Conversation Modeling on Reddit Using a Graph-Structured LSTM

Victoria Zayats
Electrical Engineering Department

University of Washington
vzayats@uw.edu

Mari Ostendorf
Electrical Engineering Department

University of Washington
ostendor@uw.edu

Abstract

This paper presents a novel approach for mod-
eling threaded discussions on social media
using a graph-structured bidirectional LSTM
(long-short term memory) which represents
both hierarchical and temporal conversation
structure. In experiments with a task of
predicting popularity of comments in Reddit
discussions, the proposed model outperforms
a node-independent architecture for different
sets of input features. Analyses show a bene-
fit to the model over the full course of the dis-
cussion, improving detection in both early and
late stages. Further, the use of language cues
with the bidirectional tree state updates helps
with identifying controversial comments.

1 Introduction

Social media provides a convenient and widely used
platform for discussions among users. When the
comment-response links are preserved, those con-
versations can be represented in a tree structure
where comments represent nodes, the root is the
original post, and each new reply to a previous com-
ment is added as a child of that comment. Some
examples of popular services with tree-like struc-
tures include Facebook, Reddit, Quora, and Stack-
Exchange. Figure 1 shows an example conversa-
tion on Reddit, where bigger nodes indicate higher
upvoting of a comment.1 In services like Twitter,

1The tool https://whichlight.github.io/
reddit-network-vis was used to obtain this visualiza-
tion.

Figure 1: Visualization of a sample thread on Reddit.

tweets and their retweets can also be viewed as form-
ing a tree structure. When time stamps are avail-
able with a contribution, the nodes of the tree can
be ordered and annotated with that information. The
tree structure is useful for seeing how a discussion
unfolds into different subtopics and showing differ-
ences in the level of activity in different branches of
the discussion.

Predicting popularity of comments in social me-
dia is a task of growing interest. Popularity has
been defined in terms of the volume of the re-
sponse, but when the social media platform has
a mechanism for readers to like or dislike com-
ments (or, upvote/downvote), then the difference in
positive/negative votes provides a more informative
score for popularity prediction. This definition of

121

Transactions of the Association for Computational Linguistics, vol. 6, pp. 121–132, 2018. Action Editor: Ani Nenkova.
Submission batch: 11/2016; Revision batch: 3/2017; Published 2/2018.

c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


(a) Forward hierarchical and timing structure (b) Backward hierarchical and timing structure

Figure 2: An example of model propagation in a graph-structured LSTM. Here, the node name are shown in a chrono-
logical order, e.g. comment t1 was made earlier than t2. 2(a) Propagation of graph-structured LSTM in the forward
direction. Blue arrows represent hierarchical propagation, green arrows represent timing propagation.2(b) Backward
hierarchical (blue) and timing (green) propagation of graph-LSTM.

popularity, which has also been called community
endorsement (Fang et al., 2016), is the task of inter-
est in our work on tree-structured modeling of dis-
cussions.

Previous studies found that the time when the
comment/post was published has a big impact on
its popularity (Lakkaraju et al., 2013). In addition,
the number of immediate responses can be predic-
tive of the popularity, but some comments with a
high number of replies can be either controversial
or have a highly negative score. Language should be
extremely important for distinguishing these cases.
Indeed, community style matching is shown to be
correlated to comment popularity in Reddit (Tran
and Ostendorf, 2016). However, learning useful lan-
guage cues can be difficult due to the low frequency
of these events and the dominance of time, topic
and other factors. Thus, in several prior studies, au-
thors constrained the problem to reduce the effect
of those factors (Lakkaraju et al., 2013; Tan et al.,
2014; Jaech et al., 2015). In this study, we have no
such constraints, but attempt to use the tree structure
to capture the flow of information in order to better
model the context in which a comment is submitted,
including both the history it responds to as well as
the subsequent response to that comment.

To capture discussion dynamics, we introduce a
novel approach to modeling the discussion using
a bidirectional graph-structured LSTM, where each
comment in the tree corresponds to a single LSTM

unit. In one direction, we capture the prior his-
tory of contributions leading up to a node, and in
the other, we characterize the response to that com-
ment. Motivated by prior findings that both response
structure and timing are important in predicting pop-
ularity (Fang et al., 2016), the LSTM units include
both hierachical and temporal components to the up-
date, which distinguishes this work from prior tree-
structured LSTM models. We assess the utility of
the model in experiments on popularity prediction
with Reddit discussions, comparing to a neural net-
work baseline that treats comments independently
but leverages information about the graph context
and timing of the comment. We analyze the results
to show that the graph LSTM provides a useful sum-
mary representation of the language context of the
comment.

As in Fang et al. (2016), but unlike other work (He
et al., 2016), our model makes use of the full discus-
sion thread in predicting popularity. While knowl-
edge of the full discussion is only useful for post-
hoc analysis of past discussions, it is reasonable to
consider initial responses to a comment, particularly
given that many responses occur within minutes of
someone posting a comment. Comments are often
popular because of witty analogies made, which re-
quires knowledge of the world beyond what is cap-
tured in current models. Responses to these com-
ments, as well as to controversial comments, can
improve popularity prediction. Responses of others

122


clearly influence the likelihood of someone to like or
dislike a comment, but also whether they even read a
comment. By introducing a forward-backward tree-
structured model, we provide a mechanism for lever-
aging early responses in predicting popularity, as
well as a framework for better understanding the rel-
ative importance of these responses.

The main contributions of this paper include: a
novel approach for representing tree-structured lan-
guage processes (e.g., social media discussions)
with LSTMs; evaluation of the model on the pop-
ularity prediction task using Reddit discussions; and
analysis of the performance gains, particularly with
respect to the role of language context.

2 Method

The proposed model is a bidirectional graph LSTM
that characterizes a full threaded discussion, assum-
ing a tree-structured response network and account-
ing for the relative order of the comments. Each
comment in a conversation corresponds to a node in
the tree, where its parent is the comment that it is re-
sponding to and its children are the responding com-
ments that it spurs ordered in time. Each node in the
tree is represented with a single recurrent neural net-
work (RNN) unit that outputs a vector (embedding)
that characterizes the interim state of the discussion,
analogous to the vector output of an RNN unit which
characterizes the word history in a sentence. In the
forward direction, the state vector can be thought of
as a summary of the discussion pursued in a partic-
ular branch of the tree, while in the backward di-
rection the state vector summarizes the full response
subtree that followed a particular comment. The
state vectors for the forward and backward direc-
tions are concatenated for the purpose of predicting
comment karma. The RNN updates – both forward
and backward – incorporate both temporal and hier-
archical (tree-structured) dependencies, since com-
menters typically consider what has already been
said in response to a parent comment. Hence, we
refer to it as a graph-structured RNN rather than a
tree-structured RNN. Figures 2(a) and 2(b) show an
example of the state connections associated with hi-
erarchical and timing structures for the forward and
backward RNNs, respectively.

The supervision signal in training will impact the

character of the state vector, and the forward and
backward state sub-vectors are likely to capture dif-
ferent phenomena. Here, the objective is to predict
quantized comment karma. We anticipate that the
forward state will capture relevance and informa-
tiveness of the comment, and the backward process
will capture sentiment and richness of the ensuing
discussion.

The specific form of the RNN used in this work is
an LSTM. The detailed implementation of the model
is described in the sections to follow.

2.1 Graph-structured LSTM

Each node in the tree is associated with an LSTM
unit. The input xt is an embedding that can incorpo-
rate both comment text and local submission context
features associated with thread structure and timing,
described further in section 2.2. The node state vec-
tor ht is generated using a modification of the stan-
dard LSTM equations to include both hierarchical
and timing structures for each comment. Specifi-
cally, we use two forget gates - one for the previous
(or subsequent) hierarchical layer, and one for the
previous (or subsequent) timing layer.

In order to describe the update equations, we
introduce notation for the hierarchical and timing
structure. In Figure 2, the nodes in the tree are num-
bered in the order that the comments are contributed
in time. To characterize graph structure, let π(t) de-
note the parent of t and κ(t) its first child. Time
structure is represented only among a set of siblings:
p(t) is the sibling predecessor in time, and s(t) is the
sibling successor. The pointers κ(t), p(t) and s(t)
are set to ∅ when t has no child, predecessor, or suc-
cessor, respectively. For example, in Figure 2(a), the
node t2 will have π(t2) = t1, κ(t2) = t4, p(t2) = ∅
and s(t2) = t3, and the node t3 will have π(t3) = t1,
κ(t3) = ∅, p(t3) = t2 and s(t3) = t5.

Below we provide the update equations for the
forward process, using the subscripts i, f, g, c,
and o for the input gate, temporal forget gate, hier-
archichal forget gate, cell, and output, respectively.
The vectors it, ft, and gt are the weights for new in-
formation, remembering old information from sib-
lings, and remembering old information from the
parent, respectively. σ is a sigmoid function, and
◦ indicates the Hadamard product. If p(t) = ∅, then

123


hp(t) and cp(t) are set to the initial state value.

it = σ(Wixt + Uihp(t) + Vihπ(t) + bi)

ft = σ(Wfxt + Ufhp(t) + Vfhπ(t) + bf)

gt = σ(Wgxt + Ughp(t) + Vghπ(t) + bg)

c̃t = Wcxt + Uchp(t) + Vchπ(t) + bc

ct = ft ◦ cp(t) + gt ◦ cπ(t) + it ◦ c̃t
ot = σ(Woxt + Uohp(t) + Vohπ(t) + bo)

ht = ot ◦ tanh(ct)

When the whole tree structure is known, we can
take advantage of the full response subtree to bet-
ter represent the node state. To that end, we define
a backward LSTM that has a similar set of update
equations except that only the first child will pass
the hidden state to its parent. Specifically, the update
equations are the same except that π(t) is replaced
with κ(t), p(t) is replaced with s(t), and a different
set of weight matrices and bias vectors are learned.

Let + and − indicate forward and backward em-
beddings respectively. On top of the LSTM unit, the
forward and backward state vectors are concatenated
and passed to a softmax layer to predict 8 quantized
karma levels:

P(yt = j|x,h) =
exp(W

j
s [h

+
t ;h

−
t ])∑8

k=1 exp(W
k
s [h

+
t ;h

−
t ])

where x and h correspond to the set of input features
and state vectors (respectively) for all nodes in the
discussion.

2.2 Input Features

The full model includes two types of features in the
input vector, including non-textual features associ-
ated with the submission context and the textual fea-
tures of the comment at that node.

The submission context features are extracted
from the graph and metadata associated with the
comment, motivated by prior work showing that
context factors such as the forum, timing and au-
thor of a post are very useful in predicting popular-
ity. The submission context features include:

• Timing: time since root, time since parent (in
hours), number of later comments, and number
of previous comments

• Author: a binary indicator as to whether the au-
thor is the original poster, and number of com-
ments made by the author in the conversation

• Graph-location: depth of the comment (dis-
tance from the root), and number of siblings

• Graph-response: number of children (direct
replies to the comment), height of the sub-
tree rooted from the node, size of that subtree,
number of children normalized for each thread
(2 normalization techniques), subtree size nor-
malized for each thread (2 normalization tech-
niques).

Two methods are used to normalize the subtree size
and number of children to compensate for variation
associated with the size of the discussion, specif-
ically: i) subtract the mean feature value in the
thread, and ii) divide by the square root of the rank
of the feature value in the thread.

These features are a superset of those used in Fang
et al. (2016). The subvector including all these fea-
tures is denoted xst .

The comment text features, denoted xct, are gen-
erated using a simple average bag-of-words repre-
sentation learned during the training:

xct =
1

N

N∑

i=1

W ie

where W ie is an embedding of the i-th word in the
comment, and N is the number of words in the
comment. Comments longer than 100 words were
truncated to reduce noise associated with long com-
ments, assuming that the early portion carries the
most information. The percentage of the comments
that exceed 100 words is around 11%−14% for the
subreddits used in the study. In all experiments, the
word embedding dimension is d = 100, and the vo-
cabulary includes only words that occurred at least
10 times in the dataset.

The input vector xt is set to either xst or [x
s
t;x

c
t],

depending on whether the experiment uses text.

2.3 Pruning
Often the number of comments in a single subtree
can be large, which leads to high training costs. A
large percentage of the comments are low karma and

124


minimally relevant for predicting karma of neigh-
bors, and many can be easily identified with simple
graph and timing features (e.g. having no replies or
contributed late in the discussion). Therefore, we
introduce a preprocessing step that identifies com-
ments that are highly likely to be low karma to de-
crease the computation cost. We then assign these
nodes to be level 0 and prune them out of the tree,
but retain a count of nodes pruned for use in a count-
weighted bias term in the update to capture informa-
tion about response volume.

For detecting low karma comments, we train a
simple SVM classifier to identify comments at the
0 karma level based on the submission context fea-
tures. If a pruned comment leads to a disconnected
graph (e.g., an internal node is pruned but not its
children), then the comment is retained in the tree.
In testing, all pruned comments are given a predicted
level of 0 and accounted for in the evaluation.

The state updates have an additional bias term for
any nodes that have subsequent sibling or children
comments pruned. For example, consider Figure 2,
if nodes {t5, t6, t7, t9} are pruned, then t8 will have
a modified forward update, and t3, t4 will have a
modified backwards update. At node t, define Mκt
to be the number of levels pruned below it, Mpt as
the number of immediately preceeding comments
pruned in its subgroup (responding to the same par-
ent), and Mst as the number of subsequent comments
pruned in its subgroup plus the non-initial comments
in the associated subtrees. In the example above,
Mκ3 = 1, M

s
3 = 2, M

s
4 = 1, M

p
8 = 1, and all

other M∗t = 0. The pointers are updated reflect the
structure of the pruned tree, so p(8) = 4, s(4) =
8, s(3) = ∅. The bias vectors rκ, rp and rs are as-
sociated with the different sets of nodes pruned.

Let + and − indicate forward and backward em-
beddings, respectively. The forward update has an
adjusted predecessor contribution (h+

p(t)
+ M

p
t rp).

The backward update adds Mst rs + M
κ
t rκ to either

h−
s(t)

or h−
κ(t)

, depending on whether it is a time or
hierarchical update, respectively.

2.4 Training
The objective function is minimum cross-entropy
over the quantized levels. All model parame-
ters are jointly trained using the adadelta optimiza-
tion algorithm (Zeiler, 2012). Word embeddings

subreddit comments threads vocab size
askwomen 0.8M 3.5K 32K
askmen 1.1M 4.5K 35K
politics 2.2M 4.9K 55K

Table 1: Data statistics.

subreddit Prec Rec % pruned
askwomen 67.9 72.4 36.9
askmen 60.1 75.3 36.1
politics 49.6 60.3 47.5

Table 2: Precision and recall of the pruning classifier and
percentage of comments pruned.

are initialized using word2vec skip-gram embed-
dings (Mikolov et al., 2013) trained on all com-
ments from the corresponding subreddit. The
code is implemented in Theano (Team et al.,
2016) and is available at https://github.
com/vickyzayats/graph-LSTM.We tune the
model over different dimensions of the LSTM unit,
and use the performance on the development set as
a stopping criteria for the training.

3 Experiments

3.1 Data

Reddit2 is a popular discussion forum platform con-
sisting of a large number of subreddits focusing on
different topics and interests. In our study, we exper-
imented with 3 subreddits: askwomen, askmen, and
politics. All the data consists of discussions made in
the period between January 1, 2014 and January 31,
2015. Table 1 shows the total amount of data used
for each of the subreddits. For each subreddit, the
threads were randomly distributed between training,
development (dev) and test sets with the proportions
of 6:2:2. The performance of the pruning classifier
on the dev set is presented in Table 2.

3.2 Task and Evaluation Metrics

Reddit karma has a Zipfian distribution, highly
skewed toward the low-karma comments. Since the
rare high karma comments are of greatest interest in
popularity prediction, Fang et al. (2016) proposes a

2https://reddit.com

125


task of predicting quantized karma (using a nonlin-
ear head-tail break rule for binning) with evaluation
using a macro average of the F1 scores for predict-
ing whether a comment exceeds each different level.
Experiments reported here use this framework.

Specifically, all the comments with karma lower
than 1 are assigned to level 0, and each subsequent
level corresponds to karma less than or equal to the
median karma in the rest of the comments based
on the training data statistics. Each subreddit has 8
quantized karma levels based on its karma distribu-
tion. There are 7 binary subtasks (does the comment
have karma at level j or higher for j = 1, . . . ,7), and
the scoring metric is the macro average of F1(j).
For tuning hyperparameters and as a stopping cri-
terion, we use a linearly weighted average of F1
scores to increase the weight on high karma com-
ments, which gives slightly better performance for
the high karma cases but has only a small effect on
the macro average.

3.3 Baseline and Contrast Systems

We compare the graph LSTM to a node-independent
baseline, which is a feedforward neural network
model consisting of input, hidden and softmax lay-
ers. This model is a simplification of the graph-
LSTM model where there is no connection between
nodes. The node-independent model characterizes a
comment without reference to either the text of the
comment that it is responding to or the comments
reacting to it. However, the model does have in-
formation on the size of the response subtree via
the submission context input features. Both node-
independent and graph-structured models are trained
with the same cost function and tuned over the same
set of hidden layer dimensions.

We contrast performance of both architectures
with and without using the text of the comment it-
self. As shown in Fang et al. (2016), simply us-
ing submission context features (graph, timing, au-
thor) gives a strong baseline. In order to evaluate
the role of each direction (forward or backward)
in the graph-structured model, we also present re-
sults using only the forward direction graph-LSTM
for comparison to the bidirectional model. In addi-
tion, in order to evaluate the importance of the lan-
guage of the comment itself vs. the language used
in the rest of the tree, we perform an interpolation

Model Text askwomen askmen politics
indep no 53.2 48.3 46.6
graph no 54.6 52.1 47.9
indep yes 52.8 50.7 47.4
interp mix 54.7 52.1 48.2

graph(f) yes 55.0 53.3 49.9
graph yes 56.4 54.8 50.4

Table 3: Average F1 score of karma level prediction for
node-independent (indep) vs. graph-structured (graph)
models with and without text features; interp corresponds
to an interpolation of the graph-structured model with-
out text and the node-independent model with text; and
graph(f) corresponds to a graph-structured model con-
tains forward direction only.

between the graph-LSTM with no language features
and the node-independent model with language fea-
tures. The relative weight for the two models is
tuned on the development set.

3.4 Karma Level Prediction

The results for the average F1 scores on the test set
are presented in Table 3. In experiments for all the
subreddits, graph-structured models outperform the
corresponding node-independent models both with
and without language features. Language features
also give a greater performance gain when used in
the graph-LSTM models. The fact that the forward
graph improves over the interpolated models shows
that it is not simply the information in the current
node that matters for karma of that node. Finally,
while the full model outperforms the forward-only
version for all the subreddits, the gain is smaller than
that obtained by the forward direction alone over the
node-independent model, so the forward direction
seems to be more important.

The karma prediction results (F1 score) at the dif-
ferent levels is shown in Figure 3. While in askmen
and askwomen subreddits the overall performance
decreases for higher levels, the politics subreddit has
an opposite trend. This may be due in part to the
lower pruning recall in the politics subreddit, but
Fang et al. (2016) also observe higher performance
for high karma levels in the politics subreddit.

126


Figure 3: F1 scores as a function of the quantized levels for different model configuration.

4 Analysis

Here, we present analyses aimed at better under-
standing the behavior of the graph-structured model
and the role of language in prediction. All analyses
are performed on the development set. The anal-
yses are motivated by considering possible scenar-
ios that are exceptions to the easy cases, which are:
i) comments that are contributed early in the discus-
sion and spawn large subtrees, likely to have high
karma, and ii) comments with small subtrees that
typically have low karma. We hypothesized three
scenarios where the bidirectional graph-LSTM with
text might be useful. One case is controversial com-
ments, which have large subtrees but do not have
high karma because of downvotes; these tend to have
overprediction of karma when using only submis-
sion context. The other two scenarios involve un-
derprediction of karma when using only submission
context. Early comments associated with few chil-
dren and a more narrow subtree (see the downward
chain in Figure 1) may spawn popular new threads
and benefit from the popularity of other comments
in the thread (more readers attracted), thus having

higher popularity than the number of children sug-
gests. Lastly, comments that are clever or humor-
ous discussion endpoints might have high popularity
but small subtrees. These two cases tend to differ in
their relative timing in the discussion.

4.1 Karma Prediction vs. Time

The first study looked at where the graph-LSTM
provides benefits in terms of timing. We plot the
average F1 score as a function of the contribution
time in Figure 4. As an approximation for time,
we use the quantized number of comments made
prior to the current comment. The plots show that
the graph-structured model improves over the node-
independent model throughout the discussion. Rel-
ative gains are larger towards the end of discussions
where the node-independent performance is lower.
A similar trend is observed when plotting average
F1 as a function of depth in the discussion tree.

While the use of text in the graph-LSTM seems
to help throughout the discussion, we hypothesized
that there would be different cases where it might
help, and these would occur at different times. In-
deed, 93% of the comments that are overpredicted

127


Figure 4: Average F1 scores as a function of time,
approximated using the number of previous comments
quantized in increments of 20.

by more than 2 levels by the node-independent
model without text (controversial comments) occur
in the first 20% of the discussion. Comments that are
underpredicted by more than 2 occur throughout the
discussion and are roughly uniform (13-19%) over
the first half of the discussion, but then quickly ramp
down. High-karma comments are rare at the end of
the discussion; less than 5% of the underpredicted
comments are in the last 30%.

4.2 Importance of Responses

In order to see how the model benefits from using the
language cues in underpredicted and overpredicted
scenarios, we look at the size of errors made by the
graph-LSTM model with and without text features.
In Figure 5, the x-axis indicates the error between
the actual karma level and the karma level predicted
by the graph-LSTM using submission context fea-
tures only. The negative errors represent the over-
predicted comments, and the positive errors repre-
sent the underpredicted comments. The y-axis rep-
resents the average error between the actual karma
level and the karma level predicted by the model
using both submission context and language fea-
tures.The x=y identity line corresponds to no benefit
from language features. Results are presented for

the politics subreddit; other subreddits have similar
trends but smaller differences for the underpredicted
cases.

We compare two models – bidirectional and for-
ward direction graph-structured LSTM – in order to
understand the role of the language of the replies vs.
the comment and its history. We find that, for the
bidirectional graph-LSTM model, language is help-
ing identify overpredicted cases more than underpre-
dicted ones. The forward direction model also out-
performs the node-independent model, but has less
benefit in overpredicted cases, consistent with our
intuition that controversy is identifiable based on the
responses. Although the comment text input is sim-
ply a bag of words, it can capture the mixed senti-
ment of the responses.

While it is not represented in the plot, larger er-
rors are much less frequent. Looking at average
F1 as a function of the number of children (direct
responses), we found that the graph-LSTM mainly
benefits nodes that have a small number of children,
consistent with the two underprediction scenarios
hypothesized. However, many underpredicted cases
are not impacted, since errors due to pruning con-
tribute to 15-40% of the underpredicted cases, de-
pending on the subreddit (highest for politics). This
explains the smaller gains for the positive side of
Figure 5.

4.3 Language Use Analysis
To provide insights into what the model is learning
about language, we looked at individual words asso-
ciated with different categories of comments, as well
as examples of the different error cases.

For the word level analysis, we classified words
in two different ways, again using the politics sub-
reddit. First, we associate words in comments with
zero or positive karma. For each word in the vocab-
ulary, we calculate the probability of a single-word
comment being level zero using the trained model
with a simplified graph structure (a post and a com-
ment) where all the inputs were set to zero except the
comment text. The lists of positive-karma and zero-
karma correspond to the 300 words associated with
the lowest and highest probability of zero-karma, re-
spectively. We identified 300 positive-karma and
zero-karma reply words in a similar fashion, using a
simplified graph with individual words us as inputs

128


Figure 5: The error between the actual karma level and the karma level predicted by the model using both submission
context and language features. Negative errors correspond to over-prediction; positive errors correspond to under-
prediction.

for the reply while predicting the comment karma.
Second, we identified words that may be indica-

tive of comments that are over- and underpredicted
by the graph-structured model without text and for
which the graph-LSTM model with text reduced the
error by more than 2 levels. Specifically, we choose
those words w in comments having the highest ratio
r = p(w|t)/p(w), where t indicates an over- or un-
derpredicted comment, subject to minimum occur-
rence constraints (5 for overpredicted comments, 15
for underpredicted comments). The 50 words with
the highest ratio were chosen for each case and any
words in both over- and underpredicted sets were
eliminated, leaving 47 words. Again, this was re-
peated for words in replies to over vs. underpre-
dicted comments, but with a minimum count thresh-
old of 20, resulting in 45 words.

The lists are noisy, similar to what is often found
with the topic model, and colored by the language
of the subreddit community, but a few trends can be
observed. Looking at the list of words associated
with replies to positive-karma comments we noticed
words that indicate humor (“LOL”, “hilarious”),
positive feedback (“Like”, “Right”), and emotion in-
dicators (“!!”, swearing). Words in comments and
replies associated with overpredicted (controversial)
cases are related to controversial topics (sexual, reg-
ulate, liberals), named political parties, and men-
tions of downvoting or indication that the comment
has been edited with the word “Edit.”

Since the two sets of lists were generated sepa-
rately, there are words in the over/under-predicted
lists that overlap with the zero/non-zero karma lists
(12 in the reply lists, 20 in the comment lists). The
majority of the overlap (26/32 words) is consistent

Figure 6: The mapping of the words in the comments
to the shared space using t-SNE in politics subred-
dit. Shown are the words that are highly associated
with positive-karma, negative-karma, underpredicted and
overpredicted comments.

with the intuition that words on the underpredicted
list should be associated with positive-karma, and
words on the overpredicted list might overlap with
the zero-karma list.

Rather than providing word lists, many neural net-
work studies illustrate trends using word embed-
ding visualization. The embeddings of the words
from the union of lists for positive-karma, zero-
karma, underpredicted and overpredicted comments
and replies were together used to learn a t-SNE map-
ping. The results are plotted for comments in Fig-
ure 6, which shows that the words that are as-
sociated with underpredicted comments (red) are
aligned with positive-karma words (green) for both
comment text and text in replies. Words associated

129


with overpredicted comments (blue) are more scat-
tered, but they are somewhat more like the zero-
karma words (yellow). The trends for words in
replies are similar.

Table 4 lists examples of the different error sce-
narios with the reference karma and predictions
of different models (node-independent without text,
feedforward graph-LSTM with text, and the full
biLSTM). The first two examples are overpredicted
(controversial) cases, where ignoring text leads to a
high karma prediction, but the reference is zero. In
the first case, the forward model incorrectly predicts
high karma because “Republican” tends to be asso-
ciated with positive karma. The model leveraging
reply text correctly predicts the low karma. In the
second case, the forward model captures reduces the
prediction, but again having the replies is more help-
ful. The next two cases are examples of underpredic-
tion due to small subtrees. Example 3 is incorrectly
labeled as level 0 by the forward and no-text models,
but because the responses mention “nice joke” and
“accurate analogy,” the bidirectional model is able
to identify it as level 7. Example 4 has only one
child, but both models using language correctly pre-
dict level 7, probably because the model has learned
that references to “Colbert” are popular. The next
two examples are underpredicted cases from early
in the discussion, many of which expressed an opin-
ion that in some way provided multiple perspectives.
Finally, the last two examples represent instances
where neither model successfully identifies a high
karma comment, which often involve analogies. Un-
like the “titanic” analogy, these did not have suffi-
cient cues in the replies.

5 Related Work

The problem of predicting popularity in social me-
dia platforms has been the subject of several studies.
Popularity as defined in terms of volume of response
has been explored for shares on Facebook (Cheng
et al., 2014) and Twitter (Bandari et al., 2012) and
Twitter retweets (Tan et al., 2014; Zhao et al., 2015;
Bi and Cho, 2016). Studies on Reddit predict karma
as popularity (Lakkaraju et al., 2013; Jaech et al.,
2015; He et al., 2016) or as community endorsement
(Fang et al., 2016). Popularity prediction is a diffi-
cult task where many factors can play a role, which

is why most prior studies control for specific factors,
including topic (Tan et al., 2014; Weninger et al.,
2013), timing (Tan et al., 2014; Jaech et al., 2015),
and/or comment content (Lakkaraju et al., 2013).
Controlling for specific factors is useful in under-
standing the components of a successful post, but it
does not reflect a realistic scenario. Studies that do
not include such constraints have looked at Twitter
retweets (Bi and Cho, 2016) and Reddit karma (He
et al., 2016; Fang et al., 2016).

The work in (He et al., 2016) uses reinforcement
learning to identify popular threads to track given
the past comment history, so it is learning language
cues relevant to high karma but it does not explicitly
predict karma. In addition, it models relevance via
an inner-product of past and new comment embed-
dings, and uses an LSTM to model inter-comment
dependencies among a collection of comments irre-
spective of their sibling-parent relationship, whereas
the LSTM in our work is over a graph that accounts
for this relationship.

The work most closely related to our study is Fang
et al. (2016). The node-independent baseline im-
plemented in our study is equivalent to their feed-
forward network baseline, but the results are not di-
rectly comparable because of differences in training
(we use more data) and input features. The most im-
portant difference in our approach is the representa-
tion of textual context using a bidirectional graph-
LSTM, including the history behind and responses
to a comment. Other differences are: i) Fang et
al. (2016) use an LSTM to characterize comments,
while our model uses a simple bag-of-words ap-
proach, and ii) they learn latent submission context
models to determine the relative importance of tex-
tual cues, while our approach uses a submission con-
text SVM to prune low karma comments (ignoring
their text). Allowing for differences in baselines, we
note that the absolute gain in performance from us-
ing text features is larger for our model, which rep-
resents language context.

Tree LSTMs are a modification of sequential
LSTMs that have been proposed for a variety of
sentence-level NLP tasks (Tai et al., 2015; Zhu et al.,
2015; Zhang et al., 2016; Le and Zuidema, 2015).
The architecture of tree LSTMs varies depending on
the task. Some options include summarizing over
the children, adding a separate forget gate for each

130


Ex karma Comment
1 0 7 7 0 Republicans are fundamentally dishonest. (politics, id:1x9pcx)
2 0 7 4 0 That is rape. She was drunk and could not consent. Period. Any of the supposed evidence

otherwise is nothing but victim blaming. (askwomen, id:2h8pyh)
3 7 0 0 7 The liberals keep saying the titanic is sinking but my side is 500 feet in the air. (politics,

id:1upfgl)
4 7 3 7 7 I miss your show, Stephen Colbert. (askmen, id:2qmpzm)
5 7 3 7 7 that is terrifying. they were given the orders to bust down the door without notice to the

residents, thereby placing themselves in danger. and ultimately, placing the lives of the
residents in danger (who would be acting out of fear and self-defense) (politics, id:1wzwg6)

6 7 0 5 6 It’s something, and also would change the way that Police unions and State Prosecutors
work. I don’t fundamentally agree with the move, since it still necessitates abuse by the
State, but it’s something. (politics, id:27chxr)

7 6 0 0 0 Chickenhawks always talk a big game as long as someone else is doing the fighting. (poli-
tics, id:1wbgpd)

8 6 0 0 0 [They] use statistics in the same way that a drunk uses lampposts: for support, rather than
illumination. -Andrew Lang. (politics, id:1yc2fj)

Table 4: Example comments and karma level predictions: reference, no text, graph(f), graph.

child (Tai et al., 2015), recurrent propagation among
siblings (Zhang et al., 2016), or use of stack LSTMs
(Dyer et al., 2015). Our work differs from these
studies in two respects: the tree structure here char-
acterizes a discussion rather than a single sentence;
and our architecture incorporates both hierarchical
and temporal recursions in one LSTM unit.

6 Conclusion

In summary, this paper presents a novel approach
for modeling threaded discussions on social media
using a graph-structured bidirectional LSTM which
represents both hierarchical and temporal conversa-
tion structure. The propagation of hidden state in-
formation in the graph provides a mechanism for
representing contextual language, including the his-
tory that a comment is responding to as well as
the ensuing discussion it spawns. Experiments on
Reddit discussions show that the graph-structured
LSTM leads to improved results in predicting com-
ment popularity compared to a node-independent
model. Analyses show that the model benefits pre-
diction over the extent of the discussion, and that
language cues are particularly important for distin-
guishing controversial comments from those that are
very positively received. Responses from even a
small number of comments seem to be useful, so it
is likely that the bidirectional model would still be
useful with a short-time lookahead for early predic-

tion of popularity.

While we evaluate the model on predicting the
popularity of comments in specific forums on Red-
dit, it can be applied to other social media platforms
that maintain a threaded structure or possibly to ci-
tation networks. In addition to popularity predic-
tion, we expect the model would be useful for other
tasks for which the responses to comments are in-
formative, such as detecting topic or opinion shift,
influence or trolls. With the more fine-grained feed-
back increasingly available on social media plat-
forms (e.g. laughter, love, anger, tears), it may be
possible to distinguish different types of popularity
as well as levels, e.g. shared sentiment vs. humor.

In this study, the model uses a simple bag-of-
words representation of the text in a comment; more
sophisticated attention-based models and/or feature
engineering may improve performance. In addition,
performance of the model on underpredicted com-
ments appears to be limited by the pruning mecha-
nism that we introduced. It would be useful to ex-
plore the tradeoffs of reducing the amount of prun-
ing vs. using a more complex classifier for prun-
ing. Finally, it would be useful to evaluate per-
formance using a short window lookahead for re-
sponses, rather than the full discussion tree.

131


Acknowledgments

This paper is based on work supported by the
DARPA DEFT Program. Views expressed are those
of the authors and do not reflect the official policy
or position of the Department of Defense or the U.S.
Government. We thank the reviewers for their help-
ful feedback.

References
Roja Bandari, Sitaram Asur, and Bernardo Huberman.

2012. The pulse of news in social media: Forecast-
ing popularity. In Proc. ICWSM.

Bin Bi and Junghoo Cho. 2016. Modeling a retweet
network via an adaptive bayesian approach. In Proc.
WWW.

Justin Cheng, Lada Adamic, P. Alex Dow, Jon Michael
Kleinberg, and Jure Leskovec. 2014. Can cascades be
predicted? In Proc. WWW.

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and
Noah A. Smith. 2015. Recurrent neural network
grammars. In Proc. NAACL.

Hao Fang, Hao Cheng, and Mari Ostendorf. 2016.
Learning latent local conversation modes for predict-
ing community endorsement in online discussions. In
Proc. SocialNLP.

Ji He, Mari Ostendorf, Xiaodong He, Jianshu Chen, Jian-
feng Gao, Lihong Li, and Li Deng. 2016. Deep
reinforcement learning with a combinatorial action
space for predicting popular Reddit threads. In Proc.
EMNLP.

Aaron Jaech, Victoria Zayats, Hao Fang, Mari Ostendorf,
and Hannaneh Hajishirzi. 2015. Talking to the crowd:
What do people react to in online discussions? In
Proc. EMNLP.

Himabindu Lakkaraju, Julian J. McAuley, and Jure
Leskovec. 2013. What’s in a name? Understanding
the interplay between titles, content, and communities
in social media. In Proc. ICWSM.

Phong Le and Willem Zuidema. 2015. Compositional
distributional semantics with long short term memory.
In Proc. *SEM.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013. Efficient estimation of word represen-
tations in vector space. In Proc. ICLR.

Kai Sheng Tai, Richard Socher, and Christopher D Man-
ning. 2015. Improved semantic representations from
tree-structured long short-term memory networks. In
Proc. ACL.

Chenhao Tan, Lillian Lee, and Bo Pang. 2014. The ef-
fect of wording on message propagation: Topic- and

author-controlled natural experiments on twitter. In
Proc. ACL.

The Theano Development Team, Rami Al-Rfou, Guil-
laume Alain, Amjad Almahairi, Christof Anger-
mueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric
Bastien, Justin Bayer, Anatoly Belikov, et al. 2016.
Theano: A python framework for fast computa-
tion of mathematical expressions. arXiv preprint
arXiv:1605.02688.

Trang Tran and Mari Ostendorf. 2016. Characterizing
the language of online communities and its relation to
community recognition. In Proc. EMNLP.

Tim Weninger, Xihao Avi Zhu, and Jiawei Han. 2013.
An exploration of discussion threads in social news
sites: A case study of the Reddit community. In Proc.
ASONAM.

Matthew D Zeiler. 2012. Adadelta: an adaptive learning
rate method. arXiv preprint arXiv:1212.5701.

Xingxing Zhang, Liang Lu, and Mirella Lapata. 2016.
Top-down tree long short-term memory networks. In
Proc. NAACL.

Qingyuan Zhao, Murat A Erdogdu, Hera Y He, Anand
Rajaraman, and Jure Leskovec. 2015. Seismic: A
self-exciting point process model for predicting tweet
popularity. In Proc. SIGKDD.

Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. 2015.
Long short-term memory over recursive structures. In
Proc. ICML.

132