Activism via attention: interpretable spatiotemporal learning to forecast protest activities


Ertugrul et al. EPJ Data Science             (2019) 8:5 
https://doi.org/10.1140/epjds/s13688-019-0183-y

R E G U L A R A R T I C L E Open Access

Activism via attention: interpretable
spatiotemporal learning to forecast protest
activities
Ali Mert Ertugrul1,2, Yu-Ru Lin1* , Wen-Ting Chung3, Muheng Yan1 and Ang Li1

*Correspondence: yurulin@pitt.edu
1School of Computing and
Information, University of
Pittsburgh, Pittsburgh, USA
Full list of author information is
available at the end of the article

Abstract
The diffusion of new information and communication technologies—social media in
particular—has played a key role in social and political activism in recent decades. In
this paper, we propose a theory-motivated, spatiotemporal learning approach,
ActAttn, that leverages social movement theories and a deep learning framework to
examine the relationship between protest events and their social and geographical
contexts as reflected in social media discussions. To do so, we introduce a novel
predictive framework that incorporates a new design of attentional networks, and
which effectively learns the spatiotemporal structure of features. Our approach is not
only capable of forecasting the occurrence of future protests, but also provides
theory-relevant interpretations—it allows for interpreting what features, from which
places, have significant contributions on the protest forecasting model, as well as
how they make those contributions. Our experiment results from three movement
events indicate that ActAttn achieves superior forecasting performance, with
interesting comparisons across the three events that provide insights into these
recent movements.

Keywords: Interpretable spatiotemporal learning; Event forecasting; Civil unrest;
Protest activities

1 Introduction
Social movements are one of the most complex collective actions. They reflect how collec-
tivities articulate and press a collectivity’s interests to make significant changes in public
policies and political decisions. Every day, news about social movement activity relevant
to a variety of contested issues is being updated, on topics ranging from civil rights, to
human rights, to gender equality, to gun control and others. Throughout human history,
protests have been a primary means of engaging in social movements, in which collectiv-
ities usually give voice to their grievances and concerns about the rights and well-being of
themselves and others [1]. In recent decades, the diffusion of new information and com-
munication technologies—social media in particular—has reshaped the political activism
of our time. From the Arab Spring, to the Occupy Wall Street movement, to the recent
March for Our Lives gun violence protests, social media has been central in providing mo-
bilizing information, coordinating demonstrations, and creating opportunities for people

© The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, pro-
vided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and
indicate if changes were made.

https://doi.org/10.1140/epjds/s13688-019-0183-y
http://crossmark.crossref.org/dialog/?doi=10.1140/epjds/s13688-019-0183-y&domain=pdf
http://orcid.org/0000-0002-8497-3015
mailto:yurulin@pitt.edu


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 2 of 26

to exchange opinions [2, 3]. In this work, our focus is whether and how online activities
can forecast offline protests.

We started conceptualizing the prediction problem by considering what motivates peo-
ple protest may help forecast; knowing the factors that drive people to protest may help to
forecast demonstrations. Literature in social movements and social psychology has pro-
posed theories and offered insights into why people protest [4–6]. For example, one fun-
damental factor of a given movement is its “connectedness,” both in terms of how events
connect with other events of a similar kind, temporally and spatially, and in terms of how
they are embedded in an environment where people share similar sociocultural context.
In other words, social movements are not merely instances of independent collective ac-
tions or protest events, but need to be investigated within their social, temporal and ge-
ographical contexts [1]. Empirically, however, in part due to the lack of proper analytical
tools, studies (including social media studies) often analyze single events or movements
via a case-study approach [7–10], or consider a large number of movement-related events
independently of their relationships in time and space [11, 12].

It is crucial to move beyond single cases or aggregate measures and consider the dy-
namic interactions among the multitude of social, temporal and spatial dimensions. Anal-
yses that are sensitive to spatial and temporal insertion will offer insights into how social
movements were different in nature and in terms of progression. For example, some move-
ments directly spoke to major national issues and garnered mass media coverage instan-
taneously, while others originated locally, relying on the efforts of ordinary advocates and
grassroots activists before receiving media attention. To illustrate such differences, in this
work we consider three recent movements—all of which connect to a similar social issue
but are different in their progression in time and space. These include the Black Lives Mat-
ter (BLM) movement, which originated in the African-American community, and became
nationally recognized during the protests and unrest in Ferguson, in August and Novem-
ber 2014 [13], as well as the marches that occurred following the white supremacist rally
that took place in Charlottesville, in August 2017. The latter received intense media cover-
age immediately following the deadly attack that killed counter-protester Heather Heyer
and President Trump’s controversial statements [14]. As shown in Fig. 1, these different
protest events left heterogeneous activity traces, both online and offline, over time and
across locations, creating significant challenges in analyzing their spatial and temporal
patterns.

Recent works in predictive modeling have shown considerable progress in predicting
and forecasting spatiotemporal events, using machine learning methods such as transfer
learning [15, 16]. However, most of them focus on prediction performance and lack the
capability to facilitate understanding the nuanced spatiotemporal characteristics of social
movement events. The theoretically-relevant questions include: in a movement, what so-
cial and activity features are associated with the subsequent events? To what extent are the
local activities (observed from within a region) predictive of the subsequent events, com-
pared to the global activities (observed outside of a region)? And what places’ activities
would have more far-reaching predictive power, in terms of signaling subsequent events
in other places? None of the existing works have been able to answer these questions. In
this work, we aim to provide a predictive modeling framework that is able to unveil the
different spatiotemporal patterns and to answer these questions.


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 3 of 26

Figure 1 Spatiotemporal occurrence for different social movements by date (x-labels) and by location
(y-labels). A red circle indicates at least one offline protest event happening on a particular day and state; the
blue shade indicates the volume of tweets posted on the corresponding day/state. Charlottesville
counterprotests exhibited burst patterns, in which most of the activities were sparked by a deadly violence
attack and President Trump’s statements on Aug 12th, 2017. In the first few days following the attack and the
statements, more protest events occurred nation-wide and larger tweet volume was observed. The Ferguson I
protests appeared to have a gradual build-up process, in which the activities were initially local (around
Missouri and few states) following the shooting of Michael Brown on Aug 9th, 2014 and later received global
attention. A global increase in tweet volume was observed until Aug 20th, 2014. The Ferguson II protests
started on Nov 24th, 2014, with the announcement of the jury decision not to indict the police officer, and
garnered global attention. The tweet volume for each state was greater in the first two days after the jury
decision compared to the other days

Figure 2 Overview of our proposed ActAttn architecture. It incorporates hierarchical attentional networks
where the top level (a) differentiates the intra-region and inter-region importance, and the second level
(b) identifies the hub regions. The temporal dependency of time-varying features in both intra- and
inter-regions are modeled using LSTM (c), with sparse feature learning using Group Lasso (d)

Our proposed work. We propose a theory-motivated, spatiotemporal learning approach
called ActAttn that addresses the aforementioned analysis challenge. Figure 2 gives an
overview of ActAttn. Using social media and protest data, ActAttn seeks to character-
ize the social, spatial, and temporal features in relation to the subsequent protest activities
in a unified and automatic manner. We develop a deep learning architecture that is not


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 4 of 26

only capable of forecasting the occurrence of future protests, but which also allows for in-
terpreting what features, from which places, have significant contributions on the protest
forecasting model, as well as how they make those contributions. To accomplish this, we
introduce a two-level attentional network architecture that (a) differentiates the feature
contribution from local (intra-region) and global (inter-region), and (b) identifies the re-
gions, referred as the “hubs”, that have a more salient contribution in predicting protest
events globally. We utilize the lexicon approach to extract a range of linguistic features that
allows for making sense of the association between the types of activity traces and future
protests. We further leverage a sparse learning approach, Group Lasso [17], to select the
compact set of features for enhancing the feature interpretability and generalizability.

Contributions. A major strength that differentiates our approach from the prior works
is its interpretability. The interpretable capability comes from our model design, which
has drawn largely upon prior social movement theories and empirical studies [1, 4–6])
regarding what motivates people to protest and what geological and sociocultural contexts
and conditions may contribute to the inception and development of protests. The model
design can be highlighted in terms of two aspects: (a) the selection of features, and (b) the
differentiation of the predictive power that comes from local spatial patterns (or beyond).

To summarize, our contributions include: (1) A unified, spatiotemporal leaning frame-
work: We propose a novel deep learning architecture, ActAttn, that automatically learns
the relationship between the spatiotemporal activity traces observed from a broader com-
munity and the future protest events. This learning framework allows for principally com-
paring the spatiotemporal patterns from different movement events. (2) Interpretability
in hierarchical attention: We use hierarchical attentional networks, together with Long
Short-Term Memory (LSTM) [18], to model the temporal and spatial dependencies in the
activity traces. The attentional networks allow for interpreting the importance of activities
in different regions (intra- vs. inter-region contribution, and hubs), in terms of forecasting
future events. This is the first model that differentiates the intra- and inter-region con-
tributions in the spatiotemporal event forecasting domain. (3) Interpretability in activity
features: We leverage Group Lasso to select a compact set of linguistic features, which
allows for understanding the type of activity traces that are more reliably associated with
future protests. (4) Extensive experiments on forecasting performance, with in-depth analy-
sis and comparison across three real-world movements: We conduct extensive experiments
on three social movement events: the counterprotests to the Charlottesville rally (August
2017), the first wave of Ferguson protests (August 2014), and the second wave of Ferguson
protests (November 2014). Our results indicate a significant improvement in forecasting
performance in comparison to several baseline and state-of-the-art methods. Moreover,
we present in-depth analysis and comparison across three protest events in terms of their
spatiotemporal characteristics and features. The results offer interesting insights regard-
ing how social media “connectedness”—as operationalized at the level of features (social
embeddedness) and the level of the model (the intra- vs. inter-region contribution)—could
predict offline protest activity. Such analyses cannot be obtained with previous models. Fi-
nally, we have made our code and data available to ensure the reproducibility of our results.

2 Related work
2.1 Theoretical perspectives on antecedents of protest behaviors
Literature in social movements and social psychology offers us insights as to why people
protest. Van Stekelenburgh and Klandermants [4, 5] proposed a motivational framework


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 5 of 26

that incorporates and synthesizes several sociopsychological factors that have been theo-
rized and studied as critical to protests: (1) Identity: individuals’ identification with certain
groups/communities brings about a shared sense of future destiny and social responsi-
bility; (2) grievance: a felt sense of illegitimate inequality; (3) emotion: emotions such as
anger, guilt, fear, shame, and despair that “amplify” the felt grievance to be stronger and
“accelerate” people to act more promptly; (4) social embeddedness: the social contexts one
is exposed to and social networks one is embedded in—e.g., the more people engage them-
selves in the environment in which information about a certain grievance can be found,
the more likely they are to start learning about the inequality and thus may take actions to
protest or call for protests; and (5) efficacy: how one perceives that protests could make a
difference.

In brief, protests are more likely to happen while people have the social interactions that
offer more opportunities to learn about grievance and they emotionally resonate such il-
legitimate inequality, while, these people identify themselves as members of the commu-
nities that are affected by or responsible for the inequality, and while they believe protests
could bring about change [4].

The framework aims to link the individual’s psychological experiences—which are situ-
ated in certain types of social interactions, and which eventually lead to collective action
and implications—and is particularly useful for our quantitative study. We are interested
in Twitter users’ individual tweeting behaviors, and whether the users are immersed in a
kind of social embeddedness in which people who are seeking, sharing, and disseminating
information about protests would come to gather together and linger. Such social embed-
dedness transforms individual grievance and emotion into their collective forms and may
further facilitate the social actions of protests. We incorporate four factors—grievance,
identity, social embeddedness, and emotion—into our model design and leverage the lex-
icon approach to operationalizing these factors (see details discussed in Sect. 3.3).

2.2 Forecasting protests and other events
There have been studies that employ social media data to examine social movements and
unrest. Most of them followed a case-study approach in which descriptive statistics, re-
gression analyses, or qualitative analysis were used for the exploration of movements [8,
9, 11, 19]. For example, Conover et al. [8] examined the temporal evolution of digital com-
munication activity related to the Occupy Wall Street movement using Twitter-centric
features including retweets, mentions, and user engagements. De Choudhury et al. [11]
studied the temporal characteristic of social media participation and its relationships to
offline protests related to BLM movement. Chung et al. [19] studied online social media
discussions during the 2014 Ferguson protests, and employed a thematic analysis to dif-
ferentiate tweets that engaged critical sensemaking from those solely focused on the event
taking place. While these case studies provide detailed descriptions of the studied events,
the analyses depend on specific questions of interests, and thus the results are sensitive to
a particular data manipulation along the spatial or temporal dimensions.

There have been studies that utilize the spatial, temporal or spatiotemporal dependen-
cies in modeling or predicting the events. Several studies employed logistic regression or
heuristics to forecast/detect events from social media related to anomalies [20, 21], crime
[22] and civil unrest [23, 24]. Cadena et al. [25] proposed an event forecasting model for
civil unrest that uses a notion of activity cascades derived from the Twitter communi-
cation networks. Ning et al. [26] proposed a multiple instance learning based approach


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 6 of 26

that jointly forecasts protest events and identifies event precursors from news articles.
Ramakrishnan et al. [27] proposed to forecast civil unrest from multiple data sources us-
ing models such as logistic regression with Lasso. Zhao et al. proposed spatiotemporal
event forecasting through an enhanced Hidden Markov Model (HMM) [28] and multi-
task learning [15, 16, 29]. Most of the existing techniques primarily focus on forecasting
performance rather than interpreting spatiotemporal characteristics of social events. In
addition, the potential interactions between temporal and spatial dimensions are often
overlooked.

In terms of analyzing online social media content in the context of social movements,
emotional commitment is the most widely studied factor. For example, De Choudhury et.
al have used LIWC lexicon [30] to extract features that cover aspects of emotional expres-
sion, cognition, perception, social orientation, interpersonal awareness, and psychological
distance [11]. On the other hand, the literature on why people protest (e.g., [4, 5]) has of-
fered theoretical foundations and empirical evidence of what factors may be critical for
protest occurrence and participation. In this work, we examine a set of new features that
can provide theoretically-relevant interpretations about a social movement.

3 Method
3.1 Problem definition
Suppose there are L locations (e.g., cities, states) of interest, and each location l can be
represented by a collection of static and dynamic features. The static features (e.g., popu-
lation, political leaning) are features that remain the same or change slowly over a longer
period of time, and the dynamic features (e.g., the percentage of tweets that express the
“anger” emotion) are updated for each time interval t (e.g., hour, day). Let Sl be the set of
static features of location l, and Xt,l be the set of dynamic features for location l at time t.
We are also given a binary variable Yt∗,l ∈ {0, 1} that indicates the occurrence of a future
protest event for each location l at time t∗. The collection of dynamic features from all
locations within an observing time window with size k up to time t can be represented as
Xt–k+1:t = {Xt–k+1, . . . , Xt }, where Xt′ = {Xt′,1, . . . , Xt′,L}.

Our goal is to predict the future event occurrence Yt∗,l at specific location l at a future
time t∗ = t + τ , where τ is called the lead time for forecasting. The forecasting is based
on the static and dynamic features of the location itself, as well as the dynamic features
in the environment (from all other locations). Therefore, the forecasting problem can be
formulated as learning a function f (Sd , Xt–k+1:t ) → Yt∗,d that maps the input, the static and
dynamic features, to a protest indicator at the future time t∗ for a target location d.

To facilitate interpretation of the protest forecasting, we seek to develop a model that
can differentiate the contribution of the features, the locality (local/intra-region features
vs. global/inter-region features), and the overall importance of each location when con-
tributing to the prediction of other locations. Therefore, we further organize the dy-
namic features Xt–k+1:t into two sets: the intra-region features, {Xt–k+1,d , . . . , Xt,d} repre-
sent the sequence of dynamic features for the location d, and the inter-region features,
{Xt–k+1,l, . . . , Xt,l} for l ∈ {1, 2, . . . , L}, contain the sequences of dynamic features for all lo-
cations of interest.

3.2 Model
As shown in Fig. 2, our proposed architecture involves three primary components: the
temporal component Mtem, the spatial component Msp , and the static features Sd . Sd pro-


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 7 of 26

vides location-specific information about the target location d. The temporal model Mtem
is designed to model the contribution of the local dynamic features (intra-region features)
for the target location. The spatiotemporal component Msp is to model the spatiotempo-
ral contribution of dynamic features for all locations of interest (inter-region features).

The recurrent unit. In both Mtem and Msp , we use LSTM as a building block in our
model to capture the temporal relationships among the dynamic features. LSTM has been
shown to be effective in capturing potential temporal dependency [31–33], and it ad-
dresses the vanishing and exploding gradient problems of basic recurrent neural networks
(RNNs) by using explicit gating mechanisms (input, output and forget gates) to regulate
the memory updates. We include a single LSTM network to model intra-region dynamics
in Mtem (Fig. 2(c)). To capture the spatiotemporal relationship among all locations in Msp
(Fig. 2(b)), we include separate temporal components, each of which has the same struc-
ture as Mtem. Each (inter-region) temporal component is then responsible for modeling
the temporal dynamics of a single location. The LSTM outputs inside Mtem and Msp are
htemd and {hsp1 , hsp2 , . . . , hspL }, respectively.

Hierarchical attention mechanism. An attention mechanism has been shown to be ef-
fective in reweighting the internal components in a neural architecture [34, 35]. We de-
sign a hierarchical attention mechanism to differentiate the importance of spatial and
temporal information. First, in Msp , we incorporate a spatial attention layer on top of
{hsp1 , hsp2 , . . . , hspL } to learn the spatial importance among all locations (Fig. 2(b)). The idea is
that not all the locations contribute equally to the prediction of event occurrence at a tar-
get location, and this attention layer is to reward the locations which contribute the most
to correctly forecasting protest occurrence in the target location. The spatial attention is
given by:

ν
sp =

∑

l

αl h
sp
l , (1)

where νsp is the spatial attention output that summarizes the aggregate contribution of
all locations, and αl is the attention weight for the location l to be learned based on a
Softmax function. Second, we introduce a spatiotemporal attention layer to differentiate
local (intra-region) and global (inter-region) feature contributions (Fig. 2(a)). The idea
behind this layer is that, in some cases, the occurrence of protest events may largely depend
on the temporal information within the locations themselves, while in other cases, the
occurrence may depend more on the context of other locations or the global dynamics.
The spatiotemporal attention layer is given by:

ν
st = αtemhtemd + α

sp
ν

sp, (2)

where αtem and αsp are the attention weights corresponding to the outputs of temporal and
spatial components, respectively. They are obtained at the output of the Softmax function.
νst is the spatiotemporal vector that aggregates the information learned from temporal and
spatial dimensions. The forecasting of the occurrence of protest events is then given by:

Ŷt∗,d = φ
(
Wc

[
Sd , νst

]
+ bc

)
, (3)

where Sd is the static feature of the target location d, and Wc and bc are the weight matrix
and bias vector to be learned in the concatenation layer, respectively. φ is the activation


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 8 of 26

function where we apply the Softmax function in order to obtain posterior probabilities
of occurrence and non-occurrence of the protest event.

Objective function. We incorporate the Group Lasso regularization into loss function.
Group Lasso has been shown to be effective in several domains, such as robotic control
[36] and multi-modal context [37] to select informative features. This regularization im-
poses sparsity on a group level, such that all the weights in a group are either simultane-
ously set to 0, or none of them are [17]. The main motivation for employing this regular-
ization is to select informative features in temporal components (Fig. 2(d)) while assigning
the optimal weights of the network at the same time. Therefore, it also enables us to in-
terpret the model in such a way that redundant information from features are minimized,
which allows for differentiating which features are important for the occurrence of protest
events. The objective function is defined as:

L = –
1
n

n∑

i=1

m∑

j=1

Yij log(pij) + λ1
∥∥W tem

∥∥
2,1 + λ2

L∑

l=1

∥∥W spl
∥∥

2,1, (4)

where the first term is cross entropy loss, n is the number of samples, m is the number of
class labels (event and non-event), and pij is the probability of the sample i being assigned
to class j by the model. W tem is the input weight matrix in Mtem , and W spl is the input
weight matrix of (inter-region) temporal component of lth location in Msp . Note that the
input weight matrix contains all weights of LSTM except for recurrent and bias weights.
Moreover, λ1 and λ2 are the regularization factors for Mtem and Msp , respectively. There-
fore, each component can be regularized by different factors. Group Lasso regularization
can be written as:

‖W ‖2,1 =
∑

g∈G

√
|g|‖g‖2, (5)

where g is the vector of outgoing connections (weights) from an input neuron, G denotes a
set of input neurons, and |g| indicates the dimension of g. We represent each input neuron
in Mtem and in each (inter-region) temporal component of Msp as a separate group so that
G contains vectors of these groups.

3.3 Features
As mentioned earlier, there are two types of features: static and dynamic.

Static features reflect the political and demographic backgrounds of a location in which
a protest event may take place, including the population of the state to which the location
belongs (given as population), population density, vote to Trump (voting behaviors in 2016
presidential election as an indicator of the degree of conservativism in the location), and
region of the United States (Northeast, Midwest, South and West). These features either
remain unchanged or change slowly over time.

Dynamic features are to capture social media users’ online activities that may be predic-
tive of offline protests. Drawn upon social movement literature [4] (discussed in Sect. 2.1),
we focus on four factors: emotion, identity, grievance, and social embeddedness.

Three dictionaries (LIWC [30], SentiSense [38], and Moral-Laden [39]) are used to cap-
ture the features indicating emotions, grievance, and identity, while additional relevant


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 9 of 26

features beyond these key factors are also included to test their usability. LIWC and Sen-
tiSense include a range of emotions, either positive or negative; LIWC offers the categories
of social and personal pronouns that may serve as indicators of identity. The Moral-Laden
dictionary is used with an attempt to capture grievance that results from the appraisal
of relative deprivation based on moral rules; the dictionary is derived from moral foun-
dation theory which suggests that humans engage in moral judgments along at least five
dimensions: Harm/Care, Cheating/Fairness, Betrayal/Loyalty, Subversion/Authority, and
Degradation/Purity. Some of the additional relevant features beyond these key factors dis-
cussed in literature are also included to test their usability.

Furthermore, in order to operationalize the type and level of social embeddedness, we
caputre social media users’ engagement in online discussion, including number of tweets,
number of reply tweets, and number of tweets with URL links. Greater volumes of any of
these tweeting behaviors (tweets, replies, and URLs) suggest that the public may be more
aware of focal issues and events, and in turn be more motivated in seeking, spreading, and
exchanging information, ideas, and emotions in cyberspaces. Such social contexts may
raise individuals’ perception of the efficacy of protests, which could lead to actual protest
actions. More replies and URL links suggest being more embedded in relevant social net-
works. Replies suggest direct interactions with other embedded users. URL links, on the
other hand, suggest information networks built based on relevant information/content
created by others, including internal links with other tweets, and external links such as
news, blogs, etc. The complete list of features and detailed interpretation are provided in
Fig. 6(a), Fig. 6(b), and Sect. 5.2.

4 Experiments
4.1 Dataset
We choose social movements with social significance in order to test the design of our
model with respect to the distinct social, temporal, and spatial dimensions of the nature of
protests. Moreover, we choose movements in which the nature of the issues were relatively
similar in order to compare and contrast the performance of the theory-driven features.
Eventually, we select two movements: Black Lives Matter (BLM) and the counter-protests
to Charlottesville’s white supremacist rally. For BLM, we selected the two separate waves
of protests regarding the police’s killing of Michael Brown in Ferguson. The Ferguson un-
rests were symbolic protests under the umbrella of BLM in opposition to systemic racism
against black people in the US. The Charlottesville counter-protests were the largest re-
cent nationwide protest activities against white supremacism in the US.

Twitter data. We collected tweets with specific keywords or hashtags: the counter-
protests to the Charlottesville rally [14], and the first and the second waves of the Ferguson
protests [13]. The size and statistics of each dataset are provided in Table 1. Charlottesville
Dataset was collected through the Streaming API based on 17 keywords and/or hashtags
of interest.a Retweets were not included. These keywords were emerging during the event
and were then widely used on Twitter to refer to the relevant issues and happenings. The
Ferguson I Dataset and Ferguson II Dataset were collected based on the published work
[40], using 45 keywords including #ferguson, #blacklivesmatter, “black lives matter” and
the names of black people killed by police during 2014 and 2015. Based on the tweet IDs
provided in the published dataset, we recollected the tweets within the two periods and
excluded the retweets.


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 10 of 26

Table 1 Basic statistics of the datasets

Dataset Duration #Tweets #Users #Protest
Occurrences

Charlottesville Aug 11–Aug 31 (2017) 11.36M 5.93M 136
Ferguson I Aug 9–Aug 27 (2014) 8.02M 2.76M 90
Ferguson II Nov 21–Dec 10 (2014) 9.86M 3.80M 104

Protest data. We collected ground-truth data from the website of Elephrameb , c on the
occurrence of offline protest events during the periods of the Charlottesville counter-
protests and the two waves of the Ferguson protests. Elephrame provides information
about civil unrest events which occurred in the US. This information is kept in a struc-
tured way and includes protest occurrence time (start date and end date), protest location
(in state-level and city-level), protest subjects (sub-type of the protest event), description,
number of participants, and at least one source link. We also incorporate news reports
about BLM protests that were collected by the authors of [11]. Each piece of protest event
information is based on the given source link(s). Note that there can be more than one
event in the same location at the same time interval. In this work, we only consider whether
an event occurred in a given location at that time interval, and we represent the occurrence
using binary variables. As a result, we observed 136, 90 and 104 offline protest events dur-
ing the three movements across the country.

Location extraction. In this work, we seek to forecast the occurrence of offline protest
events at the state level, using Twitter users’ activities. The locations of tweets are either
extracted from their geocodes (if available) or inferred from the users’ profiles. First, the
geotagged tweets posted from the United States include state information in their ‘place’
field. These kinds of posts include either a state name or state code. We directly use this in-
formation as the location indicator. Second, we find the location information of the tweets
from user profiles. We follow this approach for the tweets whose locations cannot be iden-
tified using the first approach. Similar to the first approach, we identify the locations (state
name or state code) if they are explicitly written in the user profiles. If they are not, we also
look for the names of cities located in the United States. If we identify a city name in the
profile, we map it to its corresponding state. For this purpose, we use a dictionary includ-
ing city-state pairs in the United States from Encyclopedia Britannica.d Note that there
can be more than one city with the same name in different states. Therefore, we discard
such cities in this study. In total, we were able to extract tweet locations at the state level
for 29.9%, 41.5% and 43.3% of all tweets in the Charlottesville, Ferguson I, and Ferguson II
datasets, respectively.

4.2 Comparison methods and settings
We compare our approach with several state-of-the-art approaches as the baseline meth-
ods. In order to evaluate the forecasting effectiveness of the proposed model, we select
three sets of baseline methods.

The first set includes Logistic Regression (LR) and Support Vector Machine (SVM)
classifiers, since they are widely-used machine learning methods in the event detec-
tion/forecasting literature. With these methods, we examine the effect of static, intra-
region and inter-region features by combining all features together. The second set of
methods include recently-developed neural-network-based models, such as RNNs and
LSTMs in particular, as they have been shown to have superior performance in event


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 11 of 26

forecasting problems due to their capability of modeling the temporal dependencies. The
third set of methods are the state-of-the-art spatiotemporal event forecasting approaches
recently proposed by [15], including regularized multi-task feature learning (RMTFL),
constrained multi-task feature learning I (CMTFL-1) and constrained multi-task feature
learning II (CMTFL-2). These methods formulate event forecasting for multiple locations
as a multi-task learning problem. They build event forecasting models for different loca-
tions simultaneously by restricting all locations to select a common set of features. Note
that none of the existing approaches support the hierarchical structure of features coming
from intra- and inter-regions, and we will discuss the importance of such differentiation
more in Sect. 5. The baseline methods are summarized as follows:

The first set:
• Logistic Regression (LR) is simple LR model. We have three baselines for this model.

LR[tem] uses only intra-region features, LR[s, tem] concatenates static and
intra-region features, and LR[s, tem, st] merges all features as the input.

• Support Vector Machine (SVM) is simple SVM model. SVM[tem] employs only
intra-region features, while SVM[s, tem] combines static features with intra-region
features. Also, all features are used as input in SVM[s, tem, st].

The second set:
• LSTM is a basic LSTM network that employs only intra-region features. It does not

consider static features and spatial relationships among regions.
• S + LSTM is the model where intra-region features are given as inputs to the LSTM

network. Then, the embeddings of dynamic features is concatenated with the static
features. This model does not consider the spatial relationships among regions.

• S + LSTM (GL) has the same structure as S + LSTM, yet it is trained incorporating
Group Lasso regularization. With this model, we aim to monitor the effect of Group
Lasso regularization on the performance of the S + LSTM model.

The third set:
• RMTFL employs a regularization parameter to control the model sparsity.
• CMTFL-1 introduces a constraint to control the number of features in the model for

sparsity.
• CMTFL-2 restricts the number of features selected from static and dynamic groups

separately.
Furthermore, to evaluate the effectiveness of individual components of ActAttn, includ-

ing the Group Lasso regularization and hierarchical attention mechanism (spatial and spa-
tiotemporal attentions), we include several variants of ActAttn for comparison as follows:

• ActAttn (w/o GL) has our proposed structure, yet Group Lasso regularization is not
applied during training.

• ActAttn (w/o stAttn) does not include the spatiotemporal attention layer; instead, htemd
and vsp are concatenated.

• ActAttn (w/o spAttn) does not include the spatial attention layer; instead, a linear
projection layer is used.

Settings. In the experiments, we use ‘day’ as the time unit and ‘state’ as the location unit.
The last five days from each dataset are used as the test sets, and rest as the training sets.
The training set of the Charlottesville dataset contains 127 protest events (15.6% of all
samples in the training set) and the test set contains 9 events. The training set of the Fer-
guson I dataset contains 63 protest events (9% of all samples in the training set) and the


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 12 of 26

test set contains 27 events. The training set of the Ferguson II dataset contains 82 protest
events (10.7% of all samples in the training set) and the test set contains 22 events. We
enumerate different settings of window size and lead time. The window size k is set to
be {1, 2, 3} and the lead time τ is set to be {1, 2, 3}. The hidden unit size for LSTM is 16.
The architecture is trained using the Adam optimizer [41] with a learning rate of 0.001.
For the models incorporating Group Lasso regularization, regularization factors λ1 and
λ2 are selected from the set {10–5, 10–4}. During test time, the input weights with absolute
values smaller than 10–3 are set to 0 as suggested in [17]. Our code and data are avail-
able at https://github.com/picsolab/actattn. For the state-of-the-art MTFL-based models,
the regularization parameter is set to be {10–4, 10–3, . . . , 103, 104}. The number of features
to be selected in the CMTFL-1 model is set to be {5, 10, . . . , 55}. The numbers of static
and dynamic features to be selected in the CMTFL-2 model are set to be {4, 5, 6, 7, 8} and
{5, 10, . . . , 50}, respectively.

5 Results
In this section, we present a comprehensive set of results. First, in Sect. 5.1, we show the
forecasting effectiveness of the proposed model in comparison with the baseline and state-
of-the-art forecasting approaches, and based on the aforementioned experiment settings.
In Sect. 5.2, we analyze different kinds of predictive features identified by our model and
interpret their effects in relation to the theoretical factors. In Sect. 5.3, we analyze and in-
terpret different kinds of spatial contributions (intra- vs. inter-region). Finally, in Sect. 5.4,
we explore the potential of using additional content features in the current forecasting
framework.

5.1 Performance comparison
We compare the forecasting performance of ActAttn with the comparison methods. We
organize the results to answer the following three questions:

1. Overall, how well could ActAttn forecast future protest event occurrences,
compared with the baseline methods? (Sect. 5.1.1)

2. As missing information is common in social event predicting problems, how robust
is ActAttn in dealing with missing information, compared with the baseline
methods? Additionally, will ActAttn’s spatiotemporal architecture help deal with the
missing or noisy information? (Sect. 5.1.2)

3. How early in time can ActAttn effectively predict future protest event occurrences?
(Sect. 5.1.3)

5.1.1 Overall performance
As shown in Table 2, the results indicate that ActAttn achieves the highest F-score and
AUC values on the Charlottesville (0.400 and 0.843), Ferguson I (0.462 and 0.822) and
Ferguson II (0.471 and 0.853) datasets. The F-scores for all methods are low due to the
imbalance in class distribution (9%–15% protest events). Further, while the protest occur-
rence pattern is different for each dataset (Fig. 1), ActAttn is robust with respect to various
distribution of the data, and is able to model temporal and spatial dimensions under var-
ious conditions successfully.

We show the significance of static features by comparing the results of LR[tem] with
LR[s, tem], SVM[tem] with SVM[s, tem], and LSTM with S + LSTM. It can be seen that,

https://github.com/picsolab/actattn


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 13 of 26

Table 2 Forecasting results

Charlottesville Ferguson I Ferguson II

F-score AUC F-score AUC F-score AUC

LR[tem] 0.200 0.696 0.103 0.733 0.343 0.752
LR[s,tem] 0.182 0.789 0.259 0.766 0.327 0.789
LR[s,tem,st] 0.200 0.734 0.230 0.722 0.314 0.773

SVM[tem] 0.200 0.818 0.000 0.791 0.400 0.816
SVM[s,tem] 0.186 0.809 0.000 0.796 0.408 0.837
SVM[s,tem,st] 0.000 0.782 0.000 0.754 0.313 0.780

LSTM 0.240 0.752 0.415 0.801 0.417 0.819
S + LSTM 0.267 0.778 0.423 0.804 0.439 0.838
S + LSTM (GL) 0.308 0.793 0.423 0.805 0.440 0.839

RMTFL 0.182 0.663 0.250 0.703 0.250 0.829
CMTFL-1 0.182 0.664 0.350 0.711 0.316 0.805
CMTFL-2 0.200 0.661 0.333 0.711 0.324 0.815

ActAttn (w/o GL) 0.308 0.830 0.459 0.820 0.464 0.849
ActAttn (w/o stAttn) 0.324 0.797 0.406 0.783 0.409 0.842
ActAttn (w/o spAttn) 0.333 0.836 0.448 0.812 0.448 0.846
ActAttn 0.400 0.843 0.462 0.822 0.471 0.853

in nearly all cases, combining static features with intra-region features yields better F-
score and AUC values. When we further combine inter-region features, we observe
that LR[s, tem, st] and SVM[s, tem, st] give worse results compared to LR[s, tem] and
SVM[s, tem], respectively. Thus, these models fail to capture the spatiotemporal infor-
mation from the concatenated inter-region features. In our approach, combining inter-
region features with static features and intra-region features increases the performance
in all ActAttn-based methods except ActAttn (w/o stAttn). Moreover, S + LSTM (GL) per-
forms slightly better than S + LSTM and eliminates some of the redundant inputs in all
three models.

To compare the performance of ActAttn with the state-of-the-art spatiotemporal event
forecasting approaches, we performed experiments on all the datasets with RMTFL,
CMTFL-1 and CMTFL-2 proposed by [15] by employing various parameter combinations.
We report the best test performances of these approaches on each dataset. The results in-
dicate that ActAttn significantly outperforms all three approaches on all datasets in terms
of both F-score and AUC values.e

To examine the effect of Group Lasso regularization and the hierarchical attention
mechanism, we compared the performance of ActAttn to its three variants. Although
ActAttn slightly outperforms ActAttn (w/o GL), Group Lasso regularization provides spar-
sity and selection of a compact set of features. The ActAttn model provides 95.0%, 76.6%
and 96.8% sparsity for Charlottesville, Ferguson I and Ferguson II, respectively. It is com-
puted as the ratio of zero input weights over the total number of input connections. Fur-
thermore, we compare ActAttn to ActAttn (w/o stAttn) and ActAttn (w/o spAttn) to exam-
ine the effect of the hierarchical attention mechanism. We observe that ActAttn performs
significantly better than ActAttn (w/o stAttn). This shows the importance of the spatiotem-
poral attention layer which adjusts the local and global feature contributions. Similarly,
ActAttn performs superior to ActAttn (w/o spAttn). Removal of the spatial attention layer
from the proposed architecture also results in loss of interpretation capability about the
most contributing locations. Our results reflect that incorporating spatiotemporal atten-
tion layer enhances the performance of the model the most.


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 14 of 26

5.1.2 Robustness to missing information
A common challenge in predicting/forecasting social events is that data (including but
not limited to social media data) often involve missing information or are only partially
complete. For example, social media user activity may be sparse in a certain region or at
a particular time. As ActAttn was designed to capture the spatiotemporal characteristics
and features, we expect that ActAttn would be more robust to missing data if the model
effectively captures the spatiotemporal structure from the training data. To test this, we
simulate two kinds of missing information scenarios.

(1) Missingness in time and space: A missing value could occur in any feature of any
region at any time. To simulate this, we randomly removed different levels of input data
(20%, 40%, 60% and 80%) from the test sets. We then filled the missing values by ran-
domly assigning values taken from the range of non-missing values of the corresponding
features. In this setting, the comparison methods include those methods that take all fea-
tures (static, temporal and spatial features) as input and have the best overall performance
within each of the method variants. Figure 3 shows the forecasting performances of the
methods for each dataset over different levels of missing data. The results indicate that

Figure 3 Forecasting results against varying levels of missingness (in time and space) from the test sets. The
x-axes indicate the levels of missingness, and the y-axes indicate the performance in terms of (a) AUC and
(b) F-score results


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 15 of 26

ActAttn performs significantly better (in terms of both AUC and F-score) than all the
other methods on all datasets and for almost all levels of missing data.

(2) Missingness in certain regions: The missing values could occur in a particular region
for an entire (short- or long-term) period of time. To simulate this, we randomly selected
different proportions of regions (states, ranging from 20% to 80%) and removed their in-
puts entirely from the test sets. The removed regions thus do not contribute to forecast-
ing events in any of the target regions. In this setting, we included the methods taking
features from the other states for comparison. Note that although these methods include
features from the other states, they do not differentiate intra- and inter-region contribu-
tions. Therefore, we expect that these comparison methods may suffer from missing some
degree of regional input. Figure 4 shows the forecasting performance of the methods for
each dataset over different levels of missing region information. The results show that
ActAttn outperforms the other methods in terms of both AUC and F-score on all three
datasets and for all levels of missing region information. Also, we observed that ActAttn
performs more stable in nearly all conditions.

In both scenarios, we observe that ActAttn is more robust compared to other meth-
ods. This suggests that the design of ActAttn is particularly useful in dealing with missing
information—the hierarchical attention mechanism learns important regions and sum-

Figure 4 Forecasting results against varying levels of missingness for regions (states) from the test sets. The
x-axes indicate the levels of missingness, and the y-axes indicate the performance in terms of (a) AUC and
(b) F-score results


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 16 of 26

marizes the spatiotemporal information from intra-region and inter-region features, and
the Group Lasso regularization imposes sparsity and selects an informative set of features.

5.1.3 Performance analysis with varying lead time
To examine how early in time ActAttn effectively forecasts future protest event occur-
rences, we tested the forecasting under different lead time conditions. A lead time τ is the
length of time (number of days, in our experiment) from which the data are available for
forecasting events occurring at t + τ (as defined in Sect. 3.1). We evaluated our method
with different lead time settings, where τ ∈ {1, 2, 3}. Figure 5 shows the forecasting perfor-
mances of ActAttn and comparison methods over different lead time settings. The results
indicate that ActAttn has significantly better performance compared to other methods
in terms of AUC and F-score on three datasets across almost all lead time settings. This
suggests that ActAttn is able to achieve better and more stable performance for short-
term event forecasting, up to τ = 3. Due to the limitation of our data, we do not examine
longer-term event forecasting in this work.

We further examine the performance results for ActAttn with different window size k
and lead time τ . As defined in Sect. 3.1, the window size represents the amount of informa-
tion needed for forecasting in terms of the number of consecutive days as input. The AUC

Figure 5 Forecasting results against different lead times. The x-axes indicate lead time τ , and the y-axes
indicate the performance in terms of (a) AUC and (b) F-score results


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 17 of 26

Table 3 AUC results of ActAttn with respect to different window size k and lead time τ

Charlottesville Ferguson I Ferguson II

k = 1 k = 2 k = 3 k = 1 k = 2 k = 3 k = 1 k = 2 k = 3

τ = 1 0.842 0.843 0.823 0.807 0.815 0.822 0.853 0.832 0.800
τ = 2 0.839 0.836 0.823 0.807 0.820 0.820 0.831 0.836 0.832
τ = 3 0.830 0.830 0.819 0.791 0.808 0.821 0.818 0.820 0.811

values for corresponding results are given in Table 3. Accordingly, the best performances
are achieved when (k = 2, τ = 1), (k = 3, τ = 1) and (k = 1, τ = 1) for the Charlottesville, Fer-
guson I and Ferguson II models, respectively. In general, the performance either remains
stable or decreases slightly with an increase in the lead time τ , regardless of window size k.

5.2 Interpreting the impact of features
We interpret the significance of features, organized by intra-region, inter-region, and
static. Group Lasso regularization has selected a subset of features with the most discrim-
inative power in the models.

5.2.1 Intra-region dynamic features
Which dynamic features of a state were most important for predicting future protests in
the same state? Figure 6(a) gives a summary, and we provide our interpretation below.
To better understand the significance of those features in each protest context, a manual
inspection of the tweet content is conducted.

1. Social Embeddedness. Among the three relevant features (number of tweets, number
of replies, and number of tweets with URLs), num_tweets is the most powerful that for
all of the three protest events, online activism within a state is predictive of future offline
protests in the same state. Num_urlTweet, which indicates the number of Twitter posts
that contain an external link to other sources, is also found to be a useful predictor—except
in the case of Ferguson I. This may be caused by the fact that Michael Brown’s death was
initially paid little attention by news outlets, so the external news or relevant URLs may
be less indicative of online activist engagement.

2. Emotions. Both positive and negative emotions (posemo and negemo from LIWC), are
important in all models. Particularly, anger (from LIWC) is predictive for all, which sug-
gests that anger is a good indicator in predicting protest for all cases. Moreover, certain
emotions stand out for each protest scenario. For example, disgust (from SentiSense) is
predictive in Charllottesville; hate (from SentiSense) in Ferguson I; and fear (from Sen-
tiSense) in Ferguson II.

In addition, a Moral-Laden feature, PurityVice (the extent of impurity and corruption)
unexpectedly captures an intensely annoying emotion in predicting Ferguson I protests.
We uncovered this when analyzing the relevant tweets, in which the online community
extensively express its sense of being “sick of ” or feeling “disgust” for the fact that another
black life was taken by the police.

3. Grievance. Our results indicate that Moral-Laden features are not able to capture
grievance. However, through further analysis of the feature negation (from LIWC)—the
use of words such as no, not, never—suggests it may serve as an indicator of grievance.
This feature is important for all models, and especially for Ferguson I and II. Negation is
used in online communities to emphasize appraisals of how unbelievable and unrealistic a


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 18 of 26

Figure 6 Mean absolute values of intra-region and inter-region input (gate) weights. These are the input
weights learned from the neural network model (the LSTM networks in the temporal and spatial
components) and the magnitude of weights (which can take any values) allows for a comparison of the
relative importance of different features. (a) Intra-region input weights. (b) Inter-region input weights

situation is when they learn about the specific happenings (e.g., the shooting of unarmed
Michael Brown, the grand jury’s decision to not indict Officer Wilson, and a public rally
against racism) that strongly conflict with their normal sense of moral principles, which
indicates grievance (referring to the feeling of illegitimate injustice).

4. Identity. Social (from LIWC), which refers to the use of personal pronouns—especially
plural ones such as we, you, they, and people—is predictive for all models. These terms
are extensively used to call upon in-group members (we) to recognize the grievances and
express protesting voices against out-group members (they; e.g., the police, a group con-
sidered by a majority of the online community as an embodiment of racism).

5. Others. We also observed the impact of other features. The features of both verb (from
LIWC) and present (from LIWC) are important in all cases, which indicates the use of
verbs (especially present tense of both auxiliary verbs, such as is, are, have, and can) to
emphasize the happenings and perceived grievance as serious matters of fact. We also
observed the use of action verbs such as go, take, make, need, and think, which call for
necessary actions.

The features of personal pronouns (from LIWC) are also significant predictors, which in-
volve the reference of and discussion of certain people at the center of why people protest
for or against. For example, you is important for Charlottesville; the second-person pro-


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 19 of 26

noun extensively refers to President Trump, as online activists questioned him earnestly
about his position on racism. Likewise, he is important in predicting Ferguson I protests,
which is used to refer mostly to either Michael Brown or Eric Garner, both of whom were
killed by the police; they refers primarily to the police. In Ferguson II, online activists fo-
cused more on the judicial system, which was seen as unsuccessful in delivering justice.
Thus, personal pronouns are less predictive.

5.2.2 Inter-region dynamic features
We explore the effectiveness of inter-region dynamic features by analyzing the input
weights (only the portions which connect inputs to input gates) of each temporal compo-
nent in spatial component, Msp . Figure 6(b) summarizes the importance of inter-region
dynamic features in predicting protest within given states. Large percentages (96.5%,
77.6%, and 97.9% in the cases of Charlottesville, Ferguson I and Ferguson II, respectively)
of the input weights are discarded as a result of Group Lasso regularization. We select
Virginia (VA) from the Charlottesville, California (CA) from the Ferguson I and CA from
the Ferguson II models, to analyze the inter-region input weights because these states are
all ‘hub’ states for corresponding models (explained in Sect. 5.3). The result suggests that
other states’ features are much less predictive, especially for Charlottesville and Fergu-
son II. num_tweet performs exceptionally well, which indicates that online community
activities in other states could be also significant across all other states.

5.2.3 Static features
Figure 7 shows the importance of static feature weights in the three models. The features
representing US regions indicate how predictive the region class for a given state is—e.g.,
is a state in the South more or less likely to have future protests? The results of the Char-
lottesville and Ferguson II models exhibit similar patterns, suggesting that both protest
events took place more all over the US, while Ferguson I started locally with a majority of
black communities, and its model shows that being a Southern state itself is predictive of
future protests.

5.3 Interpreting the local and global contributions and hubs
ActAttn enables us to explore the proportion of local (intra-region) and global (inter-
region) contributions in forecasting protest events, and allows for discovering the “hubs”
that have a more salient contribution in predicting protest events globally. The intra- and
inter-region contributions can be identified based on the spatiotemporal attention weights

Figure 7 Values of static feature weights. These are the static feature weights learned from the neural
network model. The weights (which can take any values) allow for a comparison of the relative importance of
different features. (a) Charlottesville model. (b) Ferguson I model. (c) Ferguson II model


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 20 of 26

Figure 8 Exploration of local and global contributions to forecasting. While the orange nodes represent the
states which are correctly predicted by the corresponding models, the gray nodes denote the states either
not correctly predicted or where no events occurred, yet still contribute to forecasting events in the correctly
predicted states with a value above a certain threshold. The edges indicate the contribution to forecasting
from source state to target state. The thicker the edge, the more the contribution

in our model, and the hubs can be identified as the regions (states) whose inter-region con-
tributions to others are significant. In our study, we observe that spatial attention weights
do not differ significantly across different samples. These weights represent an overall,
consistent spatial relationship among regions and across days. Therefore, in the follow-
ing analyses, we present both the results aggregated from all test samples as well as the
representative test samples.

5.3.1 Local vs. global contributions
To examine the differences between the local (intra-region) and global (inter-region) con-
tributions for forecasting events, we create a contribution graph for each model. As shown
in Fig. 8, the orange nodes represent states where the offline events are correctly predicted
by the model. The gray nodes represent the states where either the events are not correctly
predicted or no event occurred, yet still contribute to forecasting events in other states.
For visual clarity, we only show gray nodes having an inter-region contribution greater
than a certain threshold (0.01, 0.05 and 0.01 for Charlottesville, Ferguson I and Ferguson II,
respectively) to any of the orange nodes. An edge arrow indicates the contribution of fore-
casting a target state from a source state and the edge weight (encoded by the thickness)
reflects the contribution magnitude. Also for visual clarity, we only show edges whose
weights are more than a certain threshold, which is 0.05, 0.1 and 0.05 for Charlottesville,
Ferguson I and Ferguson II, respectively. For a target state, the self-loop represents the
intra-region contribution while other incoming edges represent the inter-region contri-
butions to that state. Note that there might be states where events occurred on multiple
days. For such states, we show the average contributions in the graph.

The hierarchical attention mechanism in our ActAttn model enables a systematic way to
interpret the intra- and inter-region contributions. The contribution from a source state to
a target state (inter-region) on a specific event day is calculated by (αsp ∗ αsource), where αsp
is the attention weight corresponding to the spatial component and αsource is the attention
weight for the source state in the spatial component, Msp . Similarly, the intra-region (lo-
cal) contribution can be estimated by (αtem + αsp ∗ αtarget ), where αtem is the attention weight
corresponding to the (Intra-) temporal component and αtarget is the attention weight for
the target state in the spatial component. As shown in Fig. 8(a), VA has a salient contribu-
tion (as a part of global contribution) to forecast the states where the events are correctly


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 21 of 26

predicted for the Charlottesville case. In other words, social media activity in VA would be
a powerful signal for forecasting offline events in the other states. Moreover, CA (mostly),
IL and MO can be regarded as hubs, as they contribute more than others to the target
states for forecasting events in Ferguson I (Fig. 8(b)). On the other hand, the inter-region
contributions from CA and NY to target states are much greater than the other states in
Ferguson II (Fig. 8(c)). Note that local (intra-region) contributions (reflected by the self-
loop weights) for any target state are higher than the contributions from any other state
in all three models. This suggests that local activity still plays a more important role than
the activity of any other states. Interestingly, in the case of Charlottesville, the global con-
tribution (the total inter-region contributions of all other states) of a target state is more
than the local one, suggesting that the Charlottesville protests have a very distinct spa-
tiotemporal process compared with other the two cases.

5.3.2 The effect of hubs
To further illustrate the hub effect, we select the representative test samples obtained from
Texas (TX), Washington (WA) and Illinois (IL), which are correctly predicted events by
the Charlottesville, Ferguson I and Ferguson II models, respectively.

In the Charlottesville model, the spatiotemporal attention weights for local and global
contributions are 0.458 and 0.542, respectively, meaning that the global part contributes
more to forecasting the protest in TX for the given sample. To further analyze the global
contribution and hub effect, we visualize the inter-region input (gate) weights and the
spatial attention weights as shown in Fig. 9. We observe that Group Lasso regularization
selects informative features from only a few states—namely VA, New York (NY), CA and
TX (Fig. 9(1a))—and the spatial attention layer further selects VA, CA and NY as hubs
(Fig. 9(1b)). VA is the most contributing hub in predicting the protest event for the given
test sample from TX. Since the trigger event of the Charlottesville Rally occurred in VA,
higher attention weight for VA is the potential indicator that our proposed model is able to
model spatiotemporal relationship among the regions successfully for the Charlottesville
dataset.

Figure 9 Exploration of global contribution and hub effect. (a) Mean absolute values of inter-region input
weights across states. (b) Attention weights of spatial attention for predicting protests in TX (1b), WA (2b), and
IL (3b)


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 22 of 26

In the Ferguson I model, the spatiotemporal attention weights for local and global con-
tributions are 0.591 and 0.409, respectively. This indicates that locality is more predictive
for the given test sample of WA. Spatial attention attends the states CA, IL, Missouri (MO)
and TX (Fig. 9(2b)), suggesting the high impact of these states. Ferguson is located in St.
Louis, MO where the shooting of Michael Brown happened. It is also very close to the IL
border. The reactions to the Ferguson shooting on social media most likely started spread-
ing from these states. CA is an active state where both online (tweet volume) and offline
activities occurred much more frequently than other places.

In the Ferguson II model, in predicting the protests in IL, the spatiotemporal attention
weights for local and global contributions are 0.576 and 0.424, respectively, for the cor-
rectly predicted test sample from IL. As shown in Fig. 9(3a) and Fig. 9(3b), CA and NY
are selected by the spatial attention as the most attended regions (among those initially
given by the Group Lasso). This suggests that the protest forecasting may be impacted by
the heightened social media discussion in these hub states, in relation to, for example, the
NYPD shooting of Akai Gurley and the arrest of BLM activists in the Bay Area during the
study period.

5.4 Testing predictive power with additional features
While our selection of features is theory-driven, we also consider the possibility of incor-
porating additional features, which are emerging from the events unfolding, that could
help increase the predictive power of the model in a meaningful way. For example, specif-
ically, we consider whether there are keywords utilized by Twitter users to plan, organize,
or mobilize protests that may also serve as effective features. Because mobilization activ-
ities and activism on Twitter, in most cases, are organized and advocated by Twitter users
through hashtags, we focus on identifying the most widely-used hashtags. We analyze the
top-k (k = 100) hashtags based on TF-IDF values. We treat each day as a document. We
then include these top-100 as additional features to see if they affect forecasting, and an-
alyze the most predictive features.

We assign the ratio of number of tweets that include the hashtag to the total number of
tweets at the specific time (day) as the feature value for the corresponding hashtag. Ac-
cording to the results given in Table 4, employing the additional features decreases the
performance in terms of both F-score and AUC for all three datasets. Furthermore, we
explore the importance of these hashtag features by analyzing the input weights. In all
three cases, less than 10% of the features have non-zero weights after Group Lasso regu-
larization, meaning that most of the features do not have any contribution to forecasting
events as both intra- and inter-region features. The informative hashtags include: “#there-
sistance” for Charlottesville; “#ferguson,” “#mikebrown” and “# justuceformikebrown” for
Ferguson I; and “#ferguson,” “# ericgarner,” “#tamirrice” and “#fergusondecision” for Fer-
guson II. However, the weights of these features are much less than the weights of those
theory-driven features we first employ in the original model.

Table 4 Forecasting results with and without hashtag features. C.F. stands for content features

Charlottesville Ferguson I Ferguson II

F-score AUC F-score AUC F-score AUC

Without C.F. 0.400 0.843 0.462 0.822 0.471 0.853
With C.F. 0.308 0.814 0.453 0.815 0.435 0.825


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 23 of 26

6 Discussion and future work
In this work, we presented an interpretable, predictive model to forecast offline protest
events from online activities. We developed a novel deep learning architecture which ef-
fectively learns a hierarchical structure of effective features, and at the same time, enables
a theory-relevant interpretation. Through extensive experiments, we demonstrated the
strength of the proposed model; compared with the baseline methods, our model achieved
superior forecasting performance for all movement datasets. It was also more robust with
regard to missing data, and consistently outperformed other methods in various early fore-
casting settings.

Our model not only outperforms existing prediction techniques, but also enables a
theory-driven feature selection, together with the differentiation of the intra- and inter-
region inputs, allowing us to examine whether these theorized factors are useful in predict-
ing protests as well as how the theoretical framework could help to interpret the model’s
efficacy and distinct performance across the chosen three threads of protests in a mean-
ingful way. Such an approach could offer insights for further investigations regarding the
nature and happenings of protests. Here, we first summarize and explicate whether and
how the theory-driven features contribute to forecasting protests. We then discuss the
limitations of our work and potential future directions.

6.1 Interpretation of the theory-driven features
First, overall, the greater volumes of tweeting and networking behaviors (including origi-
nal tweets, replies, and associated content with hyperlinks) had strong predictive power.
This result is consistent with prior empirical studies (e.g., [11])—more online discussions
may reflect higher public awareness and concern regarding the focal issues and events as-
sociated with protests and they opened a cyperspace of social embeddness. Yet, our model
allows more differentiating observation and interpretation across protests, in terms of how
the social embeddedment was shaped—by messages and interactions within the local state
or beyond. For example, we found that number of reply played a more significant role only
in Charllottesville, suggesting that there may be different natures of how the social em-
beddedness was created between Charllottesville and Ferguson. Also, number of URL link
was much more useful in Ferguson II when the tweets came from the local state where the
protests happened than when they came from other states.

Second, negative emotions have been studied and theorized to be associated with
protests [4, 6], and our results are consistent with this—particularly anger. However, other
negative emotions, such as disgust, hate, and fear also stood out, and had distinct predic-
tive power for the Charlottesville counter-protests, Ferguson I and Ferguson II, respec-
tively. Such results, together with our manual inspection of the content of sampled tweets
in order to understand what these emotions suggested, also offer insights for future stud-
ies in social movements to examine the associations between particular emotions and the
nature of protests across contexts.

Third, while one of the operationalizaion of theorized factors, grievance, did not turn
out as planned by leveraging the Moral-Laden dictionary, we discovered that the language
pattern of negation could be a potential signal of grievance. We discovered in the predic-
tion results that negation (from the LIWC dictionary) could be a good predictor feature
for all protest cases, and our manual inspection of the sampled tweets revealed that its se-
mantic meaning could serve as an indicator of grievance. This could be a potential means
to identify information of grievance in future relevant studies.


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 24 of 26

Finally, identity, operationalized by using the social category from the LIWC dictionary
was able to capture the group identities, and the results showed its predictive power, espe-
cially for Charlottesville and Furguson I, but not Furguson II; the second-person pronoun
is more predictive in Charlottesville, and the third-person in Ferguson I.

In brief, our model goes beyond indicating that online discussion, including emotional
tweets, may help predict offline protests. That point has been studied and widely recog-
nized. Rather, our study offers insights as to where (intra- or inter-) and how (the features
were not selected randomly or through unsupervised learning, but theory-driven) the fea-
tures may offer explanatory power.

6.2 Limitations and future work
There are some limitations in our current work. (1) Our results indicated that consid-
ering spatial relationships among the locations increases the performance of forecasting
protest events. However, the proposed architecture models the spatial structure irrespec-
tive of the locations of events. In other words, it does not differentiate the pairwise rela-
tionship between a particular event location and other locations. Future research might
consider modeling the relationships between pairs of locations. (2) In the context of fore-
casting protests or other civil unrest events, data is generally sparse in terms of event
occurrences. Events either increasingly happen within a short period after a trigger event,
or only occur in particular locations. The data sparsity makes it difficult to learn complex
spatiotemporal relationships. Our current model was not specifically designed to tackle
this data sparsity issue. (3) In the currently-proposed architecture, the spatial component
Msp , which models the spatial relationships over locations, is a complex component. It
consists of a set of temporal components for every location, where each component has
its own LSTM component. As the number of locations increases, the number of parame-
ters to be learned increases linearly. Although Group Lasso regularization has significantly
reduced the complexity of this component, further reducing the complexity of the model
would be more desirable.

As part of our future work, we plan to address the aforementioned limitations. In partic-
ular, we plan to explore generative models as a solution to overcome data sparsity problem
for event forecasting, as well as simplifying the model using weight sharing mechanism.

Acknowledgements
The authors would like to acknowledge the support from NSF #1634944, #1637067, #1739413, and the University of
Pittsburgh ULS Open Access Author Fee Fund. Any opinions, findings, and conclusions or recommendations expressed in
this material do not necessarily reflect the views of the funding sources.

Abbreviations
BLM, Black Lives Matter; LSTM, Long Short-Term Memory; HMM, Hidden Markov Model; LIWC, Linguistic Inquiry and Word
Count; RNNs, Recurrent Neural Networks; LR, Logistic Regression; SVM, Support Vector Machine; GL, Group Lasso; MTFL,
Multi-Task Feature Learning; RMTFL, Regularized Multi-Task Feature Learning; CMTFL, Constrained Multi-Task Feature
Learning; NN, Neural Network; SGD, Stochastic Gradient Descent; AUC, Area Under Curve; NYPD, New York Police
Department.

Availability of data and materials
Data and code are available at https://github.com/picsolab/actattn.

Competing interests
The authors declare that they have no competing interests.

Authors’ contributions
YRL, WTC, and AME conceived and designed the study. AME conducted the experiments. MY and AL contributed to the
data collection and processing. AME, YRL and WTC analyzed and interpreted the results and wrote the manuscript. All
authors read and approved the final manuscript.

https://github.com/picsolab/actattn


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 25 of 26

Author details
1School of Computing and Information, University of Pittsburgh, Pittsburgh, USA. 2Graduate School of Informatics,
Middle East Technical University, Ankara, Turkey. 3Department of Psychology in Education, School of Education,
University of Pittsburgh, Pittsburgh, USA.

Endnotes
a Keywords include: Charlottesville, KKK, Ku Klux Klan, Klansman, Klansmen, Nazi, Nazism, racism, racist, supremacy,

supremacist, supremacists, #Charlottesville, #domesticterrorism, # FireBannon, #WhiteSupremacist,
#WhiteSupremacists.

b https://elephrame.com/.
c While the tweets for Charlottesville and Ferguson were collected separately using different collection methods, the

information about protest events was collected from the same data source—the Elephrame website. As we mainly
focus on the spatiotemporal patterns of the offline protest events, the difference in terms of methods used for
collecting tweets will not significantly impact our results and interpretation.

d https://www.britannica.com/topic/list-of-cities-and-towns-in-the-United-States-2023068.
e The AUC of the best model (>0.82) suggests it is possible to rank-order or filter the states where protest events are

likely to happen with reasonable accuracy.

Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Received: 4 July 2018 Accepted: 24 January 2019

References
1. Snow DA, Soule SA, Kriesi H (2008) The Blackwell companion to social movements. Wiley, New York
2. Valenzuela S (2013) Unpacking the use of social media for protest behavior: the roles of information, opinion

expression, and activism. Am Behav Sci 57(7):920–942
3. Theocharis Y, Lowe W, van Deth JW, García-Albacete G (2015) Using Twitter to mobilize protest action: online

mobilization patterns and action repertoires in the occupy wall street, indignados, and aganaktismenoi movements.
Inf Commun Soc 18(2):202–220

4. Van Stekelenburg J, Klandermans B (2013) The social psychology of protest. Curr Sociol 61(5–6):886–905
5. Klandermans B, van Stekelenburg J (2013) The political psychology of protest. Eur Psychol 18(4):224–234
6. Goodwin J, Jasper JM (2006) Emotions and social movements. In: Handbook of the sociology of emotions. Springer,

Berlin, pp 611–635
7. González-Bailón S, Borge-Holthoefer J, Rivero A, Moreno Y (2011) The dynamics of protest recruitment through an

online network. Sci Rep 1:197
8. Conover MD, Ferrara E, Menczer F, Flammini A (2013) The digital evolution of occupy wall street. PLoS ONE 8(5):64679
9. Conover MD, Davis C, Ferrara E, McKelvey K, Menczer F, Flammini A (2013) The geospatial characteristics of a social

movement communication network. PLoS ONE 8(3):55957
10. He J, Hong L, Frias-Martinez V, Torrens P (2015) Uncovering social media reaction pattern to protest events:

a spatiotemporal dynamics perspective of ferguson unrest. In: International conference on social informatics.
Springer, pp 67–81

11. De Choudhury M, Jhaver S, Sugar B, Weber I (2016) Social media participation in an activist movement for racial
equality. In: ICWSM, pp 92–101

12. Qi H, Manrique P, Johnson D, Restrepo E, Johnson NF (2016) Open source data reveals connection between online
and on-street protest activity. EPJ Data Sci 5(1):18

13. Ferguson unrest. https://en.wikipedia.org/wiki/Ferguson_unrest. Accessed: 2018-04-01
14. Unite the Right rally. https://en.wikipedia.org/wiki/Unite_the_Right_rally. Accessed: 2018-04-01
15. Zhao L, Sun Q, Ye J, Chen F, Lu C-T, Ramakrishnan N (2015) Multi-task learning for spatio-temporal event forecasting.

In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM,
New York, pp 1503–1512

16. Zhao L, Wang J, Chen F, Lu C-T, Ramakrishnan N (2017) Spatial event forecasting in social media with geographically
hierarchical regularization. Proc IEEE 105(10):1953–1970

17. Scardapane S, Comminiello D, Hussain A, Uncini A (2017) Group sparse regularization for deep neural networks.
Neurocomputing 241:81–89

18. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
19. Chung WT, Lin YR, Li A, Ertugrul AM, Yan M (2018) March with and without feet: the talking about protests and

beyond. In: International conference on social informatics. Springer, pp 134–150
20. Panagiotou N, Zygouras N, Katakis I, Gunopulos D, Zacheilas N, Boutsis I, Kalogeraki V, Lynch S, O’Brien B (2016)

Intelligent urban data monitoring for smart cities. In: Joint European conference on machine learning and
knowledge discovery in databases. Springer, Berlin, pp 177–192

21. Teng X, Yan M, Ertugrul AM, Lin YR (2018) Deep into hypersphere: robust and unsupervised anomaly discovery in
dynamic networks. In: International joint conference on artificial intelligence.

22. Gerber MS (2014) Predicting crime using Twitter and kernel density estimation. Decis Support Syst 61:115–125
23. Korkmaz G, Cadena J, Kuhlman CJ, Marathe A, Vullikanti A, Ramakrishnan N (2015) Combining heterogeneous data

sources for civil unrest forecasting. In: 2015 IEEE/ACM international conference on advances in social networks
analysis and mining (ASONAM). IEEE, New York, pp 258–265

24. Korolov R, Lu D, Wang J, Zhou G, Bonial C, Voss C, Kaplan L, Wallace W, Han J, Ji H (2016) On predicting social unrest
using social media. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining
(ASONAM). IEEE, New York, pp 89–95

https://elephrame.com/
https://www.britannica.com/topic/list-of-cities-and-towns-in-the-United-States-2023068
https://en.wikipedia.org/wiki/Ferguson_unrest
https://en.wikipedia.org/wiki/Unite_the_Right_rally


Ertugrul et al. EPJ Data Science             (2019) 8:5 Page 26 of 26

25. Cadena J, Korkmaz G, Kuhlman CJ, Marathe A, Ramakrishnan N, Vullikanti A (2015) Forecasting social unrest using
activity cascades. PLoS ONE 10(6):0128879

26. Ning Y, Muthiah S, Rangwala H, Ramakrishnan N (2016) Modeling precursors for event forecasting via nested
multi-instance learning. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery
and data mining. ACM, New York, pp 1095–1104

27. Ramakrishnan N, Butler P, Muthiah S, Self N, Khandpur R, Saraf P, Wang W, Cadena J, Vullikanti A, Korkmaz G et al (2014)
‘Beating the news’ with EMBERS: forecasting civil unrest using open source indicators. In: Proceedings of the 20th
ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 1799–1808

28. Zhao L, Chen F, Lu C-T, Ramakrishnan N (2015) Spatiotemporal event forecasting in social media. In: Proceedings of
the 2015 SIAM international conference on data mining. SIAM, Philadelphia, pp 963–971

29. Zhao L, Wang J, Guo X (2018) Distant-supervision of heterogeneous multitask learning for social event forecasting
with multilingual indicators. In: AAAI

30. Chung C, Pennebaker JW (2007) The psychological functions of function words. In: Social communication,
pp 343–359

31. Ma J, Gao W, Mitra P, Kwon S, Jansen BJ, Wong K-F, Cha M (2016) Detecting rumors from microblogs with recurrent
neural networks. In: IJCAI, pp 3818–3824

32. Tuor A, Kaplan S, Hutchinson B, Nichols N, Robinson S (2017) Predicting user roles from computer logs using
recurrent neural networks. In: AAAI, pp 4993–4994

33. Hu W, Singh KK, Xiao F, Han J, Chuah C-N, Lee YJ (2018) Who will share my image? Predicting the content diffusion
path in online social networks. In: Proceedings of the eleventh ACM international conference on web search and
data mining. ACM, New York, pp 252–260

34. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. Preprint.
arXiv:1409.0473

35. Denil M, Bazzani L, Larochelle H, de Freitas N (2012) Learning where to attend with deep architectures for image
tracking. Neural Comput 24(8):2151–2184

36. Zhao L, Hu Q, Wang W (2015) Heterogeneous feature selection with multi-modal deep neural networks and sparse
group lasso. IEEE Trans Multimed 17(11):1936–1948

37. Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X et al (2016) Co-occurrence feature learning for skeleton based action
recognition using regularized deep LSTM networks. In: AAAI, vol 2, p 8

38. de Albornoz JC, Plaza L, Gervás P (2012) Sentisense: an easily scalable concept-based affective lexicon for sentiment
analysis. In: LREC, pp 3562–3567

39. Graham J, Haidt J, Nosek BA (2009) Liberals and conservatives rely on different sets of moral foundations. J Pers Soc
Psychol 96(5):1029

40. Freelon D, McIlwain CD, Clark MD (2016) Beyond the hashtags: #ferguson, #blacklivesmatter, and the online struggle
for offline justice

41. Kingma D, Ba J (2014) Adam: a method for stochastic optimization. Preprint. arXiv:1412.6980

http://arxiv.org/abs/arXiv:1409.0473
http://arxiv.org/abs/arXiv:1412.6980

	Activism via attention: interpretable spatiotemporal learning to forecast protest activities
	Abstract
	Keywords

	Introduction
	Related work
	Theoretical perspectives on antecedents of protest behaviors
	Forecasting protests and other events

	Method
	Problem deﬁnition
	Model
	Features

	Experiments
	Dataset
	Comparison methods and settings

	Results
	Performance comparison
	Overall performance
	Robustness to missing information
	Performance analysis with varying lead time

	Interpreting the impact of features
	Intra-region dynamic features
	Inter-region dynamic features
	Static features

	Interpreting the local and global contributions and hubs
	Local vs. global contributions
	The effect of hubs

	Testing predictive power with additional features

	Discussion and future work
	Interpretation of the theory-driven features
	Limitations and future work

	Acknowledgements
	Abbreviations
	Availability of data and materials
	Competing interests
	Authors' contributions
	Author details
	Endnotes
	Publisher's Note
	References