Understanding Satirical Articles Using Common-Sense

Dan Goldwasser
Purdue University

Department of Computer Science
dgoldwas@purdue.edu

Xiao Zhang
Purdue University

Department of Computer Science
zhang923@purdue.edu

Abstract

Automatic satire detection is a subtle text clas-
sification task, for machines and at times,
even for humans. In this paper we argue that
satire detection should be approached using
common-sense inferences, rather than tradi-
tional text classification methods. We present
a highly structured latent variable model cap-
turing the required inferences. The model ab-
stracts over the specific entities appearing in
the articles, grouping them into generalized
categories, thus allowing the model to adapt
to previously unseen situations.

1 Introduction

Satire is a writing technique for passing criticism
using humor, irony or exaggeration. It is often
used in contemporary politics to ridicule individual
politicians, political parties or society as a whole.
We restrict ourselves in this paper to such politi-
cal satire articles, broadly defined as articles whose
purpose is not to report real events, but rather to
mock their subject matter. Satirical writing often
builds on real facts and expectations, pushed to ab-
surdity to express humorous insights about the situ-
ation. As a result, the difference between real and
satirical articles can be subtle and often confusing
to readers. With the recent rise of social media
outlets, satirical articles have become increasingly
popular and have famously fooled several leading
news agencies1. These misinterpretations can often

1
https://newrepublic.com/article/118013/

satire-news-websites-are-cashing-gullible-

outraged-readers

Vice President Joe Biden suddenly barged in, asking if
anyone could “hook [him] up with a Dixie cup” of their
urine. “C’mon, you gotta help me get some clean whiz.
Shinseki, Donovan, I’m looking in your direction” said
Biden.

“Do you want to hit this?” a man asked President Barack
Obama in a bar in Denver Tuesday night. The president
laughed but didn’t indulge. It wasn’t the only time Obama
was offered weed on his night out.

Figure 1: Examples of real and satirical articles. Top:
satirical news excerpt. Bottom: real news excerpt.

be attributed to careless reading, as there is a clear
line between unusual events finding their way to the
news and satire, which intentionally places key po-
litical figures in unlikely humorous scenarios. The
two can be separated by carefully reading the arti-
cles, exposing the satirical nature of the events de-
scribed in such articles.

In this paper we follow this intuition. We look
into the satire detection task (Burfoot and Bald-
win, 2009), predicting if a given news article is
real or satirical, and suggest that this prediction task
should be defined over common-sense inferences,
rather than looking at it as a lexical text classifica-
tion task (Pang and Lee, 2008; Burfoot and Bald-
win, 2009), which bases the decision on word-level
features.

To further motivate this observation, consider the
two excerpts in Figure 1. Both excerpts mention
top-ranking politicians (the President and Vice Pres-
ident) in a drug-related context, and contain infor-
mal slang utterances, inappropriate for the subjects’

537

Transactions of the Association for Computational Linguistics, vol. 4, pp. 537–549, 2016. Action Editor: Timothy Baldwin.
Submission batch: 1/2016; Revision batch: 5/2016; Published 12/2016.

c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


position. The difference between the two examples
is apparent when analyzing the situation described
in the two articles: The first example (top), de-
scribes the Vice President speaking inappropriately
in a work setting, clearly an unrealistic situation. In
the second (bottom) the President is spoken to inap-
propriately, an unlikely, yet not unrealistic, situation.
From the perspective of our prediction task, it is ad-
visable to base the prediction on a structured repre-
sentation capturing the events and their participants,
described in the text.

The absurdity of the situation described in satir-
ical articles is often not unique to the specific in-
dividuals appearing in the narrative. In our exam-
ple, both politicians are interchangeable: placing the
president in the situation described in the first ex-
cerpt would not make it less absurd. It is therefore
desirable to make a common-sense inference about
high-ranking politicians in this scenario.

We follow these intuitions and suggest a novel
approach for the satire prediction task. Our
model, COMSENSE, makes predictions by making
common-sense inferences over a simplified narra-
tive representation. Similarly to prior work (Cham-
bers and Jurafsky, 2008; Goyal et al., 2010; Wang
and McAllester, 2015) we represent the narrative
structure by capturing the main entities (and tracking
their mentions throughout the text), their activities,
and their utterances. The result of this process is a
Narrative Representation Graph (NRG). Figure 2 de-
picts examples of this representation for the excerpts
in Figure 1.

Given an NRG, our model makes inferences
quantifying how likely are each of the represented
events and interactions to appear in a real, or satiri-
cal context. Annotating the NRG for such inferences
is a challenging task, as the space of possible situa-
tions is extremely large. Instead, we frame the re-
quired inferences as a highly-structured latent vari-
able model, trained discriminatively as part of the
prediction task. Without explicit supervision, the
model assigns categories to the NRG vertices (for
example, by grouping politicians into a single cate-
gory, or by grouping inappropriate slang utterances,
regardless of specific word choice). These category
assignments form the infrastructure for higher-level
reasoning, as they allows the model to identify the
commonalities between unrelated people, their ac-

tions and their words. The model learns common-
sense patterns leading to real or satirical decisions
based on these categories. We express these pat-
terns as parametrized rules (acting as global fea-
tures in the prediction model), and base the predic-
tion on their activation values. In our example, these
rules can capture the combination of (EPolitician) ∧
(Qslang)→ Satire, where EPolitician and Qslang are
latent variable assignments to entity and utterance
categories respectively.

Our experiments look into two variants of satire
prediction: using full articles, and the more chal-
lenging sub-task of predicting if a quote is real given
its speaker. We use two datasets collected 6 years
apart. The first collected in 2009 (Burfoot and Bald-
win, 2009) and an additional dataset collected re-
cently. Since satirical articles tend to focus on cur-
rent events, the two datasets describe different peo-
ple and world events. To demonstrate the robust-
ness of our COMSENSE approach we use the first
dataset for training, and the second as out-of-domain
test data. We compare COMSENSE to several com-
peting systems including a state-of-the-art Convo-
lutional Neural Network (Kim, 2014). Our experi-
ments show that COMSENSE outperforms all other
models. Most interestingly, it does so with a larger
margin when tested over the out-of-domain dataset,
demonstrating that it is more resistant to overfitting
compared to other models.

2 Related Work

The problem of building computational models deal-
ing with humor, satire, irony and sarcasm has at-
tracted considerable interest in the the Natural Lan-
guage Processing (NLP) and Machine Learning
(ML) communities in recent years (Wallace et al.,
2014; Riloff et al., 2013; Wallace et al., 2015; Davi-
dov et al., 2010; Karoui et al., 2015; Burfoot and
Baldwin, 2009; Tepperman et al., 2006; González-
Ibánez et al., 2011; Lukin and Walker, 2013; Fi-
latova, 2012; Reyes et al., 2013). Most work has
looked into ironic expressions in shorter texts, such
as tweets and forum comments. Most related to our
work is Burfoot and Baldwin (2009) which focused
on satirical articles. In that work the authors sug-
gest a text classification approach for satire detec-
tion. In addition to using bag-of-words features, the

538


authors also experiment with semantic validity fea-
tures which pair entities mentioned in the article,
thus capturing combinations unlikely to appear in a
real context. This paper follows a similar intuition;
however, it looks into structured representations of
this information, and studies their advantages.

Our structured representation is related to several
recent reading comprehension tasks (Richardson et
al., 2013; Berant et al., 2014) and work on narrative
representation such, as event-chains (Chambers and
Jurafsky, 2009; Chambers and Jurafsky, 2008), plot-
units (Goyal et al., 2010; Lehnert, 1981) and Story
Intention Graphs (Elson, 2012). Unlike these works,
narrative representation is not the focus of this work,
but rather provides the basis for making inferences,
and as result we choose a simpler (and more ro-
bust) representation, most closely resembling event
chains (Chambers and Jurafsky, 2008)

Making common-sense inferences is one of the
core missions of AI, applicable to a wide range of
tasks. Early work (Reiter, 1980; McCarthy, 1980;
Hobbs et al., 1988) focused on logical inference,
and manual construction of such knowledge repos-
itories (Lenat, 1995; Liu and Singh, 2004). More
recently, several researchers have looked into au-
tomatic common-sense knowledge construction and
expansion using common-sense inferences (Tandon
et al., 2011; Bordes et al., 2011; Socher et al.,
2013; Angeli and Manning, 2014). Several works
have looked into combining NLP with common-
sense (Gerber et al., 2010; Gordon et al., 2011;
LoBue and Yates, 2011; Labutov and Lipson, 2012;
Gordon et al., 2012). Most relevant to our work is
a SemEval-2012 task (Gordon et al., 2012), looking
into common-sense causality identification predic-
tion.

In this work we focus on a different task, satire
detection in news articles. We argue that this task is
inherently a common-sense reasoning task, as iden-
tifying the satirical aspects in narrative text does not
require any specialized training, but instead relies
heavily on common expectations of normative be-
havior and deviation from it in satirical text. We
design our model to capture these behavioral expec-
tations using (weighted) rules, instead of relying on
lexical features as is often the case in text categoriza-
tion tasks. Other common-sense frameworks typi-
cally build on existing knowledge bases represent-

?C?mon, you got t a hel p me get  some 
cl ean whiz-  Shinseki, Donovan, I ?m 
l ooking in your  dir ect ion"

MNR

Ar gument s
 and 
modi f i er s  

Pr edi c at es

Ani mat e
Ent i t i es

bar ge

Suddenl y

Quote

Ask

Vice Pr esident  Joe Biden

(a) NRG for a satirical article

TMPQuote

Ask

a man

"Do you  want  
t o hit  t his?"

Tuesday 
Night

Bar  in 
Denver

Did 
Not

   Pr esident  Bar ack Obama  The Pr esident

CoRef

Ar gument s
 and 
modi f i er s  

Pr edi c at es

Ani mat e
Ent i t i es

LOC

Laugh

A0A1A0

I ndul ge

A0

NEG

(b) NRG for a real article

Figure 2: Narrative Representation Graph (NRG) for
two article snippets

ing world knowledge; however, specifying in ad-
vance the behaviors commonly associated with peo-
ple based on their background and situational con-
text, to the extent it can provide good coverage for
our task, requires considerable effort. Instead, we
suggest to learn this information from data directly,
and our model learns jointly to predict and represent
the satirical elements of the article.

3 Modeling

Given a news article, our COMSENSE system
first constructs a graph-based representation of the
narrative, denoted Narrative Representation Graph
(NRG), capturing its participants, their actions and
utterances. We describe this process in more de-
tail in Section 3.1. Based on the NRG, our model
makes a set of inferences, mapping the NRG ver-
tices to general categories abstracting over the spe-
cific NRG. These abstractions are formulated as la-
tent variables in our model. The system makes a
prediction by reasoning over the abstract NRG, by

539


decomposing it into paths, where each path captures
a partial view of the abstract NRG. Finally we asso-
ciate the paths with the satire decision output. The
COMSENSE model then solves a global inference
problem, formulated as an Integer Linear Program
(ILP) instance, looking for the most likely explana-
tion of the satire prediction output, consistent with
the extracted patterns. We explain this process in
detail in Section 3.2.

NRG Abstraction as Common-Sense The main
goal of the COMSENSE approach is to move away
from purely lexical models, and instead base its de-
cisions on common-sense inferences. We formulate
these inferences as parameterized rules, mapping el-
ements of the narrative, represented using the NRG,
to a classification decision. The rules’ ability to cap-
ture common-sense inferences hinges on two key el-
ements. First, the abstraction of NRG nodes into
typed narrative elements allows the model to find
commonalities across entities and their actions. This
is done by associating each NRG node with a set of
latent variables. Second, constructing the decision
rules according to the structure of the NRG graph
allows us to model the dependencies between narra-
tive elements. This is done by following the paths
in the abstract NRG, generating rules by combining
the latent variables representing nodes on the path,
and associating them with a satire decision variable.

Computational Considerations When setting up
the learning system, there is a clear expressiv-
ity/efficiency tradeoff over these two elements. In-
creasing the number of latent variables associated
with each NRG node would allow the model to learn
a more nuanced representation. Similarly, gener-
ating rules by following longer NRG paths would
allow the model to condition its satire decision on
multiple entities and events jointly. The added
expressivity does not come without price. Given
the limited supervision afforded to the model when
learning these rules, additional expressivity would
result in a more difficult learning problem which
could lead to overfitting. Our experiments demon-
strate this tradeoff, and in Figure 4 we show the ef-
fect of increasing the number of latent variables on
performance. An additional concern with increas-
ing the model’s expressivity is computational effi-
ciency. Satire prediction is formulated as an ILP

inference process jointly assigning values to the la-
tent variables and making the satire decision. Since
ILP is exponential in the number of variables, in-
creasing the number of latent variables would be
computationally challenging. In this paper we take
a straight-forward approach to ensuring computa-
tional tractability by limiting the length of NRG
paths considered by our model to a constant size
c=2. Assuming that we have m latent categories as-
sociated with each node, each path would generate
mc ILP variables (see Section 3.3 for details), hence
the importance of limiting the length of the path. In
the future we intend to study approximate inference
methods that can help alleviate this computational
difficultly, such as using LP-approximation (Martins
et al., 2009).

3.1 Narrative Representation Graph for News
Articles

The Narrative Representation Graph (NRG) is a sim-
ple graph-based representation for narrative text, de-
scribing the connections between entities and their
actions. The key motivation behind NRG was to pro-
vide the structure necessary for making inferences,
and as a result we chose a simple representation that
does not take into account cross-event relationships,
and nuanced differences between some of the event
argument types. While other representations (Mani,
2012; Goyal et al., 2010; Elson, 2012) capture more
information, they are harder to construct and more
prone to error. We will look into adapting these
models for our purpose in future work.

Since satirical articles tend to focus on political
figures, we design the NRG around animate entities
that drive the events described in the text, their ac-
tions (represented as predicate nodes), their contex-
tualizing information (location-modifiers, temporal
modifiers, negations), and their utterances. We omit-
ted from the graph other non-animate entity types.
In Figure 2 we show an example of this representa-
tion.

Similar in spirit to previous work (Goyal et al.,
2010; Chambers and Jurafsky, 2008), we represent
the relations between the entities that appear in the
story using a Semantic Role Labeling system (Pun-
yakanok et al., 2008) and collapse all the entity men-
tions into a single entity using a Co-Reference reso-
lution system (Manning et al., 2014). We attribute

540


utterances to their speaker based on a previously
published rule based system (O’Keefe et al., 2012).

Formally, we construct a graph G = {V,E},
where V consists of three types of vertices: AN-
IMATE ENTITY (e.g., people), PREDICATE (e.g.,
actions) and ARGUMENT (e.g., utterances, loca-
tions). The edges E capture the relationships be-
tween vertices. The graph contains several different
edges. COREF edges collapse the mentions of the
same entity into a single entity, ARGUMENT-TYPE
edges connect ANIMATE ENTITY nodes to PRED-
ICATE nodes2, and PREDICATE nodes to argument
nodes (modifiers). Finally we add QUOTE edges
connecting ANIMATE ENTITY nodes to utterances
(ARGUMENT).

3.2 Satire Prediction using the Narrative
Representation Graph

Satire prediction is inherently a text classification
problem. Such problems are often approached us-
ing a Bag-of-Words (BoW) model which ignores the
document structure when making predictions. In-
stead, the NRG provides a structured representation
for making the satire prediction. We begin by show-
ing how the NRG can be used directly and then dis-
cuss how to enhance it by mapping the graph into
abstract categories.

Directly Using NRG for Satire Prediction We
suggest a simple approach for extracting features di-
rectly from the NRG, by decomposing it into graph
paths, without mapping the graph into abstract cat-
egories. This simple, word-based representation for
prediction structured according to the NRG (denoted
NARRLEX), generates features by using the words
in the original document, corresponding to the graph
decomposition. For example, consider the path con-
necting “a man” to an utterance in Figure 2(b). Sim-
ple features could associate the utterances words
with that entity, rather than with the President. The
resulting NARRLEX model generates Bag-of-Words
features based on words corresponding to NRG path
vertices, conditioned on their connected entity ver-
tex.

Using Common-Sense for Satire Prediction Un-
like the NARRLEX model, which relies on directly

2These edges are typed according to their semantic roles.

observed information, our COMSENSE model per-
forms inference over higher level patterns. In this
model the prediction is a global inference process,
taking into account the relationships between NRG
elements (and their abstraction into categories) and
the final prediction. This process is described in Fig-
ure 3.

First, the model associates a high level category,
that can be reused even when other, previously un-
seen, entities are discussed in the text. We associate
a set of Boolean variables with each NRG vertex,
capturing higher level abstraction over this node.

We define three types of categories correspond-
ing to the three types of vertices, and denote them
E,A,Q for Entity category, Action category and
Quote category, respectively. Each category vari-
able can take k different values. As a convention
we denote X = i as category assignment, where
X ∈ {E,A,Q} is the category type, and i is its as-
signment. Since these category assignments are not
directly observed, they are treated as latent variables
in our model. This process is exemplified at the top
right corner of Figure 3.

Combinations of category assignments form pat-
terns used for determining the prediction. These pat-
terns can be viewed as parameterized rules. Each
weighted rule associates a combination with an out-
put variable (SATIRE or REAL). Examples of such
rules are provided in the middle of the right corner
of Figure 3. We formulate the activations of these
rules as Boolean variables, whose assignments are
highly interconnected. For example, the variables
representing the following rules (E = 0)→ SATIRE
and (E = 0)→ REAL are mutually exclusive, since
assigning a T value to either one entails a satire
(or real) prediction. To account for this interdepen-
dency, we add constraints capturing the relations be-
tween rules.

The model makes predictions by combining the
rule weights and predicting the top scoring output
value. The prediction can be viewed as a derivation
process, mapping article entities to categories (e.g.,
ENTITY(“A MAN”)→ (E=0), is an example of such
derivation), combinations of categories compose
into prediction patterns (e.g., (E=0)→SATIRE). We
use an ILP solver to find the optimal derivation se-
quence. We describe the inference process as an In-
teger Linear Program in the following section.

541


"do you want  
t o hit  t his?"

Quote

A0

a man Pr esident  Bar ak Obama

Laugh

Ar gument s
 and 
modi f i er s  

Pr edi c at es

Ani mat e
Ent i t i es E=1

Q=1

A=0

E=0

Commonsense Pr edict ion Rules

Lat ent  Cat egor y Assignment s
Entity("a man")      (E=0)
Entity("president Barak Obama")      (E=1)     
Predicate("laugh")        (A=0)
Quote("Do you want to hit This?")        (Q=1)

(E=0)       SATIRE
(E=0)       SATIRE
(A=0)       SATIRE
(Q=1)       SATIRE
(E=1)     (A=0)       SATIRE
(E=0)     (Q=1)       SATIRE

(E=0)       REAL
(E=0)       REAL
(A=0)       REAL
(Q=1)       REAL
(E=1)     (A=0)       REAL
(E=0)     (Q=1)       REAL

Figure 3: Extracting Common-sense prediction rules.

3.3 Identifying Relevant Interactions using
Constrained Optimization

We formulate the decision as a 0-1 Integer Linear
Programming problem, consisting of three types of
Boolean variables: category assignments indicator
variables, indicator variables for common-sense pat-
terns, and finally the output decision variables. Each
indicator variable is also represented using a feature
set, used to score its activation.

3.3.1 Category Assignment Variables
Each node in the NRG is assigned a set of com-

peting variables, mapping the node to different cate-
gories according to its type.

• ANIMATE ENTITY Category Variables, de-
noted hi,j,E, indicating the Entity category i for
NRG vertex j.

• ACTION Category Variables, denoted hi,j,A, in-
dicating the Action category i for NRG vertex j.

• QUOTE Category Variables, denoted hi,j,Q, in-
dicating the Quote category i for NRG vertex j.

The number of possible categories for each vari-
able type is a hyper-parameter of the model.

Variable activation constraints Category as-
signments to the same node are mutually exclusive
(a node can only have a single category). We encode
this fact by constraining the decision with a linear
constraint (where X ∈{E,A,Q}):

∀j
∑

i

hi,j,X = 1.

Category Assignment Features Each deci-
sion variable decomposes into a set of features,
φ(x,hi,j,X ) capturing the words associated with the
j-th vertex, conditioned on X and i.

3.3.2 Common-sense Patterns Variables
We represent common-sense prediction rules us-

ing an additional set of Boolean variables, connect-
ing the category assignments variables with the out-
put prediction. The space of possible variables is
determined by decomposing the NRG into paths of
size up to 2, and associating two Boolean variables
with category assignment variables corresponding to
the vertices on these paths. One of the variables as-
sociates the sequence of category assignment vari-
ables with a REAL output value, and one with a
SATIRE output value.

• Single Vertex Path Patterns Variables, denoted
by hBhi,j,X , indicating that the category assignment
captured by hi,j,X is associated with output value
B (where B∈{SATIRE, REAL}).

• Two Vertex Path Patterns Variables, denoted by
hB
(hi,j,X1),(hk,l,X2)

, indicating that the pattern cap-
tured by category assignment along the NRG path
of hi,j,X1 and hi,j,X2 is associated with output
value B (where B∈{SATIRE, REAL}).
Decision Consistency constraints It is clear that

the activation of the common-sense Patterns Vari-
ables entails the activation of the category assign-
ment variables, corresponding to the elements of the
common-sense patterns. For readability we only
write the constraint for the Single Vertex Path Vari-
ables:

542


(hBhi,j,X ) =⇒ (hi,j,X ).

Features Similar to the category assignment
variable features, each decision variable decom-
poses into a set of features, φ(x,hBhi,j,X ). These fea-
tures captures the words associated with each of the
category assignment variables (in this example, the
words associated with the j-th vertex) conditioned
on the category assignments and the output predic-
tion value (in this example, X, i and B). We also add
a feature φ(hi,j,X,B) capturing the connection be-
tween the output value B, and category assignment.

3.3.3 Satire Prediction Variables
Finally, we add two more Boolean variables cor-

responding to the output prediction: hSatire and
hReal. The activation of these two variables is mu-
tually exclusive, we encode that by adding the con-
straint:

hSatire + hReal = 1.

We ensure the consistency of our model adding
constraints forcing agreement between the final pre-
diction variables, and the common-sense patterns
variables:

hBhi,j,X =⇒ h
B.

Overall Optimization Function
The Boolean variables described in the previous

section define a space of competing inferences. We
find the optimal output value derivation by finding
the optimal set of variables assignments, by solving
the following objective:

max
y,h

∑
i
hiw

T φ(x,hi,y)

s.t. C, ∀i; hi ∈{0, 1}, (1)

where hi ∈ H is the set of all variables defined
above and C is the set of constraints defined over
the activation of these variables. w is the weight
vector, used to quantify the feature representation of
each h, obtained using a feature function φ(·).

Note that the Boolean variable acts as a 0-1 indi-
cator variable. We formalize Eq. (1) as an ILP in-
stance, which we solve using the highly optimized
Gurobi toolkit3.

3
http://www.gurobi.com/

4 Parameter Estimation for COMSENSE

The COMSENSE approach models the decision as
interactions between high-level categories of enti-
ties, actions and utterances. However, the high level
categories assigned to the NRG vertices are not ob-
served, and as a result we view it as a weakly super-
vised learning problem, where the category assign-
ments correspond to latent variable assignments. We
learn the parameters of these assignments by using a
discriminative latent structure learning framework.

The training data is a collection
D ={(xi,yi)}ni=1, where xi is an article, parsed
into an NRG representation, and y is a binary label,
indicating if the article is satirical or real.

Given this data we estimate the models’ parame-
ters by minimizing the following objective function.

LD(w) = min
w

λ

2
||w||2 + 1

n

n∑

i=1

ξi (2)

ξi is the slack variable, capturing the margin vio-
lation penalty for a given training example, and de-
fined as follows:

ξi = max
y,h

f(x,h,y,w) + cost(y,yi)

− max
h

f(x,h,yi,w),

where f(·) is a scoring function, similar to the one
used in Eq. 1. The cost function is the margin that
the true prediction must exceed over the competing
label, and it is simply defined as the difference be-
tween the model prediction and the gold label. This
formulation is an extension of the hinge loss for la-
tent structure SVM. λ is the regularization parame-
ter controlling the tradeoff between the l2 regularizer
and the slack penalty.

We optimize this objective using the stochastic
sub-gradient descent algorithm (Ratliff et al., 2007;
Felzenszwalb et al., 2009). We can compute the sub-
gradient as follows:

∇LD(w) = λw +
n∑

i=1

Φ(xi,yi,y
∗)

Φ(xi,yi,y
∗) = φ(xi,h

∗,yi) −φ(xi,h∗,y∗),

where φ(xi,h∗,y∗) is the set of features represent-
ing the solution obtained after solving Eq. 14 and

4modified to accommodate the margin constraint

543


making a prediction. φ(xi,h∗,yi) is the set of fea-
tures representing the solution obtained by solving
Eq. 1 while fixing the outcome of the inference pro-
cess to the correct prediction (i.e., yi). Intuitively, it
can be considered as finding the best explanation for
the correct label using the latent variables h.

In the stochastic version of the sub gradient de-
scent algorithm we approximate ∇LD(w) by com-
puting the sub gradient of a single example and mak-
ing a local update. This version resembles the latent-
structure perceptron algorithm (Sun et al., 2009).
We repeatedly iterate over the training examples and
for each example, if the current w leads to a correct
prediction (and satisfies the margin constraint), we
only shrink w according to λ. If the model makes an
incorrect prediction, the model is updated according
Φ(xi,yi,y

∗). The optimization objective LD(W) is
not convex, and the optimization procedure is guar-
anteed to converge to a local minimum.

5 Empirical Study

We design our experimental evaluation to help clar-
ify several questions. First, we want to understand
how our model compares with traditional text classi-
fication models. We hypothesize that these methods
are more susceptible to overfitting, and design our
experiments accordingly. We compare the models’
performance when using in-domain data (test and
training data are from the same source), and out-of-
domain data, where the test data is collected from
a different source. We look into two tasks. One
is the Satire detection task (Burfoot and Baldwin,
2009). We also introduce a new task, called “did
I say that?” which only focuses on utterances and
speakers.

The second aspect of our evaluation focuses on
the common-sense inferences learned by our model.
We examine how the size of the set of categories
impacts the model performance. We also provide
a qualitative analysis of the learned categories us-
ing a heat map, capturing the activation strength of
learned inferences over the training data.

Prediction tasks We look into two prediction
tasks: (1) Satire Detection (denoted SD), a binary
classification task, in which the model has access to
the complete article (2) “Did I say that?” (denoted
DIST), a binary classification task, consisting only

of entities mentions (and their surrounding context
in text) and direct quotes. The goal of the DIST
is to predict if a given utterance is likely to be real,
given its speaker. Since not all document contain di-
rect quotes, we only use a subset of the documents
in the SD task.

Datasets In both prediction tasks we look into
two settings: (1) In-domain prediction: where the
training and test data are collected from the same
source, and (2) out-of-domain prediction, where the
test data is collected from a different source. We
use the data collected by Burfoot and Baldwin
(2009) for training the model in both settings, and
its test data for in-domain prediction (denoted TRAIN
- SD’09, TEST - SD’09, TRAIN - SD’09 - DIST, TEST -

SD’09 - DIST, respectively for training and testing in
the SD and DIST tasks). In addition, we collected
a second dataset of satirical and real articles (de-
noted SD’16). This collection of articles contains
real articles from cnn.com and satirical articles from
theonion.com, a well known satirical news website.
The articles were published between 2010 to 2015,
appearing in the political sections of both news web-
sites. Following other work in the field, all datasets
are highly skewed toward the negative class (real ar-
ticles), as it better characterizes a realistic prediction
scenario. The statistics of the datasets are summa-
rized in Table 2.

Evaluated Systems We compare several systems,
as follows:

System
ALLPOS Always predict Satire

BB’09 Results by (Burfoot and Baldwin, 2009)

CONV Convolutional NN. We followed (Kim,
2014), using pre-trained 300-dimensional
word vectors (Mikolov et al., 2013).

LEX SVM with unigram (LEXU ) or both uni-
gram and bigram (LEXU+B ) features

NARRLEX SVM with direct NRG-based features (see
Sec 3.2)

COMSENSE Our model. We denote the full model as
COMSENSEF , and COMSENSEQ when us-
ing only the entity+quotes based patterns.

We tuned all the models’ hyper-parameters by us-
ing a small validation set, consisting of 15% of the
training data. After setting the hyper-parameters, the

544


model was retrained using the entire dataset. We
used SVM-light5 to train our lexical baseline sys-
tems (LEX and NARRLEX). Since the data is highly
skewed towards the negative class (REAL), we ad-
just the learner objective function cost factor for pos-
itive examples to outweigh negative examples. The
cost factor was tuned using the validation set.

5.1 Experimental Results
Since our goal is to identify satirical articles, given
significantly more real articles, we report the F-
measure of the positive class. The results are sum-
marized in Tables 1 and 3. We can see that in all
cases the COMSENSE model obtains the best results.
We note that in both tasks, when learning in the out-
of-domain settings performance drops sharply, how-
ever the gap between the COMSENSE model and
other models increases in these settings, showing
that it is less prone to overfitting.

Interestingly, for the satire detection (SD) task,
the COMSENSEQ model performs best for the in-
domain setting, and COMSENSEF gives the best per-
formance in the out-of-domain settings. We hypoth-
esize that this is due to a phenomenon we call “over-
fitting to document structure”. Lexical models tend
to base the decision on word choices specific to the
training data, and as a result when tested on out of
domain data, which describes new events and enti-
ties, performance drops sharply. Instead, the COM-
SENSEQ model focuses on properties of quotations
and entities appearing in the text. In the SD’09
datasets, this information helps focus the learner,
as the real and satire articles are structured differ-
ently (for example, satire articles frequently contain
multiple quotes). This structure is not maintained
when working with out-of-domain data, and indeed
in these settings the model benefits from using addi-
tional information offered by the full model.

Number of Latent Categories Our COM-
SENSE model is parametrized with the number
of latent categories it considers for each entity,
predicate and quote. This hyper-parameter can have
a strong influence on the model performance (and
running time). Increasing it adds to the model’s
expressivity allowing it to learn more complex
patterns, but also defines a more complex learning

5
http://svmlight.joachims.org/

0.31

0.35

0.38

0.42

0.45

0.49

2 3 4 5 6

EV=2EV=2 EV=3EV=3

EV=1EV=1 LexLex

Quote Vars

F
-

S
co

re

Figure 4: Different Number of Latent Categories. EV de-
notes the number entity categories used, and QuoteVars denotes
the number of quote categories used.

problem (recall our non-convex learning objective
function). We focused on the DIST task when
evaluating different configurations as it converged
much faster than the full model. Figure 4 plots the
model behavior when using different numbers of
latent categories. Interestingly, the number of entity
categories saturates faster than the number of quote
categories. This can be attributed to the limited text
describing entities.

Visualizing Latent COMSENSE Patterns Given
the assignment to latent categories, our model learns
common-sense patterns for identifying satirical and
real articles based on these categories. Ideally, these
patterns could be extracted directly from the data,
however providing the resources for this additional
prediction task is not straightforward. Instead, we
view the category assignment as latent variables,
which raises the question - what are the categories
learned by the model?

In this section we provide a qualitative evaluation
of these categories and the prediction rules identified
by the system using the heat map in Figure 5. For
simplicity, we focus on the DIST task, which only
has categories corresponding to entities and quotes.

(a) Prediction Rules These patterns are expressed
as rules, mapping category assignments to output
values. In the DIST task, we consider combinations
of entity and quote category pairs, denoted Ei,Qj, in
the heat map. The top part of Figure 5, in red, shows
the activation strength of each of the category com-

545


Task: SD
INDOMAIN (SD’09+SD’09) OUTDOMAIN (SD’09+SD’16)

P R F P R F

ALLPOS 0.063 1 0.118 0.121 1 0.214
BB’09 0.945 0.690 0.798 - - -
CONV 0.822 0.531 0.614 0.517 0.310 0.452
LEXU 0.920 0.690 0.790 0.298 0.579 0.394
LEXU+B 0.840 0.720 0.775 0.347 0.367 0.356
NARRLEX 0.690 0.590 0.630 0.271 0.425 0.330
COMSENSEQ 0.839 0.780 0.808 0.317 0.706 0.438
COMSENSEF 0.853 0.70 0.770 0.386 0.693 0.496

Table 1: Results for the SD task

E0,	Q0 E0,	Q1 E0,	Q2 E1,	Q0 E1,	Q1 E1,	Q2 E0,	Q0 E0,	Q1 E0,	Q2 E1,	Q0 E1,	Q1 E1,	Q2
Satire Rule Rule
Real Activation Activation

Quote	 Topics	 Activation Entity	 Topics	 Activation
Satire Profanity President
Real
Satire Drugs Liberal
Real
Satire Polite Conserva-
Real tive
Satire Science Annony-
Real mous
Satire Legal Politics
Real
Satire Politics Speaker
Real
Satire Contro- Law	Enfo-
Real versy rcement

Figure 5: Visualization of the categories learned by the models. Color coding capture the activation strength of manually
constructed topical word groups, according to each latent category. Darker colors indicate higher values. Ei (Qi), indicates an
entity (Quote) variable assigned the i-th category.

Data REAL SATIRE
TRAIN - SD’09 2505 133
TEST - SD’09 1495 100
TEST - SD’16 3117 433
TRAIN - SD’09 - DIST 1160 112
TEST - SD’09 - DIST 680 85
TEST - SD’16- DIST 1964 362

Table 2: Datasets statistics.

binations when making predictions over the train-
ing data. Darker colors correspond to larger values,
which were computed as:

cell(CE,CQ,B) =

∑
j
hB
(hCE,j,E),(hCQ,j,Q)

∑
j,k,l

hB
(hk,j,E),(hl,j,Q)

Intuitively, each cell value in Figure 5 is the number
of times each category pattern appeared in REAL or

SATIRE output predictions, normalized by the over-
all number of pattern activations for each output.

We assume that different patterns will be associ-
ated with satirical and real articles, and indeed we
can see that most entities and quotes appearing in
REAL articles fall into a distinctive category pattern,
E0,Q0. Interestingly, there is some overlap between
the two predictions in the most active SATIRE cate-
gory (E1,Q0). We hypothesize that this is due to the
fact that the two article types have some overlap.

(b) Associating topic words with learned cate-
gories In order to understand the entity and quote
categories emerging from the training phase, we
look at the activation strength of each category pat-
tern with respect to a set of topic words. We manu-
ally identified a set of entity types and quote topics,
which are likely to appear in political articles. We
associate a list of words with each one of these types.

546


Task: DIST
INDOMAIN (DIST’09+DIST’09) OUTDOMAIN (DIST’09+DIST’16)

P R F P R F

ALLPOS 0.110 1 0.198 0.155 1 0.268
LEXU 0.837 0.423 0.561 0.407 0.328 0.363
COMSENSEQ 0.712 0.553 0.622 0.404 0.561 0.469

Table 3: Results for the DIST task

For example, the entity topic PRESIDENT was asso-
ciated with words such as president, vice-president,
Obama, Biden, Bush, Clinton. Similarly, we associ-
ated with the quote topic PROFANITY a list of pro-
fanity words. We associate 7 types with quote cate-
gories corresponding to style and topic, namely PRO-
FANITY, DRUGS, POLITENESS, SCIENCE, LEGAL, POLITICS,

CONTROVERSY, and another set of seven types with
entity types, namely PRESIDENT, LIBERAL, CONSERVA-
TIVE, ANONYMOUS, POLITICS, SPEAKER, LAW ENFORCE-

MENT.

In the bottom left part of Figure 5 (in blue), we
show the activation strength of each category with
respect to the set of selected quote topics. Intu-
itively, we count the number of times the words as-
sociated with a given topic appeared in the text span
corresponding to a category assignment pair, sepa-
rately for each output prediction. We normalize this
value by the total number of topic word occurrences,
over all category assignment pairs. Note that we
only look at the text span corresponding to quote
vertices in the NRG. We provide a similar analysis
for entity categories in the bottom right part of Fig-
ure 5 (in green). We show the activation strength
of each category with respect to the set of selected
entity topic words. As can be expected, we can see
that profanity words are only associated with satir-
ical categories, and even more interestingly, when
words appear in both satirical and real predictions,
they tend to fall into different categories. For ex-
ample, the topic words related to DRUGS can ap-
pear both in real articles discussing alcohol and drug
policies. But topic words related to drugs also ap-
pear in satirical articles portraying politicians using
these substances. While these are only qualitative
results, we believe they provide strong intuitions for
future work, especially considering the fact that the
activation values do not rely on direct supervision,
and only reflect the common-sense patterns emerg-
ing from the learned model.

6 Summary and Future Work

In this paper we presented a latent variable model
for satire detection. We followed the observation
that satire detection is inherently a semantic task and
modeled the common-sense inferences required for
it using a latent variable framework.

We designed our experiments specifically to ex-
amine if our model can generalize better than un-
structured lexical models by testing it on out-of-
domain data. Our experiments show that in these
challenging settings, the performance gap between
our approach and the unstructured models increases,
demonstrating the effectiveness of our approach.

In this paper we restricted ourselves to limited
narrative representation. In the future we intend to
study how to extend this representation to capture
more nuanced information.

Learning common-sense representation for pre-
diction problems has considerable potential for NLP
applications. As the NLP community considers in-
creasingly challenging tasks focusing on semantic
and pragmatic aspects, the importance of finding
such common-sense representation will increase.
In this paper we demonstrated the potential of
common-sense representations for one application.
We hope these results will serve as a starting point
for other studies in this direction.

References
Gabor Angeli and Christopher D Manning. 2014. Nat-

uralli: Natural logic inference for common sense rea-
soning. In Proc. of the Conference on Empirical Meth-
ods for Natural Language Processing (EMNLP).

Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby
Vander Linden, Brittany Harding, Brad Huang, Peter
Clark, and Christopher D. Manning. 2014. Model-
ing biological processes for reading comprehension.
In Proc. of the Conference on Empirical Methods for
Natural Language Processing (EMNLP).

Antoine Bordes, Jason Weston, Ronan Collobert, and
Yoshua Bengio. 2011. Learning structured embed-

547


dings of knowledge bases. In Proc. of the National
Conference on Artificial Intelligence (AAAI).

Clint Burfoot and Timothy Baldwin. 2009. Automatic
satire detection: Are you having a laugh? In Proc. of
the Annual Meeting of the Association Computational
Linguistics (ACL).

Nathanael Chambers and Dan Jurafsky. 2008. Unsuper-
vised learning of narrative event chains. In Proc. of
the Annual Meeting of the Association Computational
Linguistics (ACL).

Nathanael Chambers and Dan Jurafsky. 2009. Unsuper-
vised learning of narrative schemas and their partici-
pants. In Proc. of the Annual Meeting of the Associa-
tion Computational Linguistics (ACL).

Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010.
Semi-supervised recognition of sarcastic sentences in
twitter and amazon. In Proc. of the Annual Confer-
ence on Computational Natural Language Learning
(CoNLL).

David K Elson. 2012. Dramabank: Annotating agency
in narrative discourse. In Proc. of the International
Conference on Language Resources and Evaluation
(LREC).

P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and
D. Ramanan. 2009. Object detection with discrimina-
tively trained part based models. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 99(1).

Elena Filatova. 2012. Irony and sarcasm: Corpus gen-
eration and analysis using crowdsourcing. In Proc. of
the International Conference on Language Resources
and Evaluation (LREC).

Matt Gerber, Andrew S Gordon, and Kenji Sagae.
2010. Open-domain commonsense reasoning using
discourse relations from a corpus of weblog stories.
In Proc. of the NAACL HLT 2010 First International
Workshop on Formalisms and Methodology for Learn-
ing by Reading.

Roberto González-Ibánez, Smaranda Muresan, and Nina
Wacholder. 2011. Identifying sarcasm in twitter: a
closer look. In Proc. of the Annual Meeting of the As-
sociation Computational Linguistics (ACL). Associa-
tion for Computational Linguistics.

Andrew S Gordon, Cosmin Adrian Bejan, and Kenji
Sagae. 2011. Commonsense causal reasoning using
millions of personal stories. In Proc. of the National
Conference on Artificial Intelligence (AAAI).

Andrew S Gordon, Zornitsa Kozareva, and Melissa
Roemmele. 2012. Semeval-2012 task 7: choice of
plausible alternatives: an evaluation of commonsense
causal reasoning. In Proc. of the Sixth International
Workshop on Semantic Evaluation.

Amit Goyal, Ellen Riloff, and Hal Daumé III. 2010. Au-
tomatically producing plot unit representations for nar-

rative text. In Proc. of the Conference on Empirical
Methods for Natural Language Processing (EMNLP).

Jerry R Hobbs, Mark Stickel, Paul Martin, and Douglas
Edwards. 1988. Interpretation as abduction. In Proc.
of the Annual Meeting of the Association Computa-
tional Linguistics (ACL).

Jihen Karoui, Farah Benamara, V?ronique Moriceau,
Nathalie Aussenac-Gilles, and Lamia Hadrich Bel-
guith. 2015. Towards a contextual pragmatic model to
detect irony in tweets. In Proc. of the Annual Meeting
of the Association Computational Linguistics (ACL).

Yoon Kim. 2014. Convolutional Neural Networks for
Sentence Classification. In Proc. of the Conference on
Empirical Methods for Natural Language Processing
(EMNLP).

Igor Labutov and Hod Lipson. 2012. Humor as circuits
in semantic networks. In Proc. of the Annual Meeting
of the Association Computational Linguistics (ACL).

Wendy G. Lehnert. 1981. Plot units and narrative sum-
marization. Cognitive Science, 5(4):293–331.

Douglas B Lenat. 1995. Cyc: A large-scale investment
in knowledge infrastructure. Communications of the
ACM, 38(11):33–38.

Hugo Liu and Push Singh. 2004. Conceptnet?a practi-
cal commonsense reasoning tool-kit. BT technology
journal, 22(4):211–226.

Peter LoBue and Alexander Yates. 2011. Types of
common-sense knowledge needed for recognizing tex-
tual entailment. In Proc. of the Annual Meeting of the
Association Computational Linguistics (ACL).

Stephanie Lukin and Marilyn Walker. 2013. Really?
well. apparently bootstrapping improves the perfor-
mance of sarcasm and nastiness classifiers for online
dialogue. In Proc. of the Workshop on Language Anal-
ysis in Social Media.

Inderjeet Mani. 2012. Computational Modeling of Nar-
rative. Synthesis Lectures on Human Language Tech-
nologies. Morgan & Claypool Publishers.

Christopher D. Manning, Mihai Surdeanu, John Bauer,
Jenny Finkel, Steven J. Bethard, and David McClosky.
2014. The Stanford CoreNLP natural language pro-
cessing toolkit. In Proc. of the Annual Meeting of the
Association Computational Linguistics (ACL).

André FT Martins, Noah A Smith, and Eric P Xing.
2009. Polyhedral outer approximations with applica-
tion to natural language parsing. In Proc. of the Inter-
national Conference on Machine Learning (ICML).

J. McCarthy. 1980. Circumscription a form of non-
monotonic reasoning. Artificial Intelligence, 13(1,2).

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S
Corrado, and Jeffrey Dean. 2013. Distributed rep-
resentations of words and phrases and their composi-
tionality. In The Conference on Advances in Neural
Information Processing Systems (NIPS).

548


Tim O’Keefe, Silvia Pareti, James R Curran, Irena Ko-
prinska, and Matthew Honnibal. 2012. A sequence
labelling approach to quote attribution. In Proc. of the
Conference on Empirical Methods for Natural Lan-
guage Processing (EMNLP).

Bo Pang and Lillian Lee. 2008. Opinion mining and
sentiment analysis. Foundations and trends in infor-
mation retrieval, 2(1-2):1–135.

V. Punyakanok, D. Roth, and W. Yih. 2008. The impor-
tance of syntactic parsing and inference in semantic
role labeling. Computational Linguistics, 34(2).

Nathan D. Ratliff, J. Andrew Bagnell, and Martin Zinke-
vich. 2007. (approximate) subgradient methods for
structured prediction. In Proc. of the International
Conference on Artificial Intelligence and Statistics
(AISTATS).

Raymond Reiter. 1980. A logic for default reasoning.
Artificial intelligence, 13(1):81–132.

Antonio Reyes, Paolo Rosso, and Tony Veale. 2013. A
multidimensional approach for detecting irony in twit-
ter. Language Resources and Evaluation, 47(1):239–
268.

Matthew Richardson, Christopher J. C. Burges, and Erin
Renshaw. 2013. Mctest: A challenge dataset for the
open-domain machine comprehension of text. In Proc.
of the Conference on Empirical Methods for Natural
Language Processing (EMNLP).

Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalindra De
Silva, Nathan Gilbert, and Ruihong Huang. 2013.
Sarcasm as contrast between a positive sentiment and
negative situation. In Proc. of the Conference on
Empirical Methods for Natural Language Processing
(EMNLP).

Richard Socher, Danqi Chen, Christopher D Manning,
and Andrew Ng. 2013. Reasoning with neural ten-
sor networks for knowledge base completion. In The
Conference on Advances in Neural Information Pro-
cessing Systems (NIPS).

Xu Sun, Takuya Matsuzaki, , Daisuke Okanohara, and
Junichi Tsujii. 2009. Latent variable perceptron al-
gorithm for structured classication. In Proc. of the In-
ternational Joint Conference on Artificial Intelligence
(IJCAI).

Niket Tandon, Gerard De Melo, and Gerhard Weikum.
2011. Deriving a web-scale common sense fact
database. In Proc. of the National Conference on Arti-
ficial Intelligence (AAAI).

Joseph Tepperman, David R Traum, and Shrikanth
Narayanan. 2006. ” yeah right”: sarcasm recognition
for spoken dialogue systems. In Proc. of Interspeech.

Byron C. Wallace, Do Kook Choe, Laura Kertz, and Eu-
gene Charniak. 2014. Humans require context to in-
fer ironic intent (so computers probably do, too). In

Proc. of the Annual Meeting of the Association Com-
putational Linguistics (ACL).

Byron C. Wallace, Do Kook Choe, and Eugene Charniak.
2015. Sparse, contextually informed models for irony
detection: Exploiting user communities, entities and
sentiment. In Proc. of the Annual Meeting of the Asso-
ciation Computational Linguistics (ACL).

Hai Wang and Mohit Bansal Kevin Gimpel David
McAllester. 2015. Machine comprehension with syn-
tax, frames, and semantics. In Proc. of the Annual
Meeting of the Association Computational Linguistics
(ACL).

549


550