Automatically Tagging Constructions of Causation and Their Slot-Fillers Jesse Dunietz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213, USA jdunietz@cs.cmu.edu Lori Levin and Jaime Carbonell Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {lsl,jgc}@cs.cmu.edu Abstract This paper explores extending shallow seman- tic parsing beyond lexical-unit triggers, using causal relations as a test case. Semantic pars- ing becomes difficult in the face of the wide variety of linguistic realizations that causation can take on. We therefore base our approach on the concept of CONSTRUCTIONS from the linguistic paradigm known as CONSTRUCTION GRAMMAR (CxG). In CxG, a construction is a form/function pairing that can rely on arbi- trary linguistic and semantic features. Rather than codifying all aspects of each construc- tion’s form, as some attempts to employ CxG in NLP have done, we propose methods that offload that problem to machine learning. We describe two supervised approaches for tag- ging causal constructions and their arguments. Both approaches combine automatically in- duced pattern-matching rules with statistical classifiers that learn the subtler parameters of the constructions. Our results show that these approaches are promising: they significantly outperform naı̈ve baselines for both construc- tion recognition and cause and effect head matches. 1 Introduction Historically, shallow semantic parsing has focused on tagging predicates expressed by individual lexical units. While this paradigm has been fruitful, tying meaning to lexical units excludes some essential se- mantic relationships that cannot be captured in such a representation. One domain that highlights the problem is causal relations. Causation can be expressed in a tremen- 1. THIS BILL promotes consolidation and coopera- tion among regulatory agencies. 2. SUCH SWELLING can impede breathing. 3. WE DON’T HAVE MUCH TIME, so let’s move quickly. 4. She’s mad because I HID THE CAR KEYS. 5. He died from A BLOCKED ARTERY. 6. Making money is contingent on FINDING A GOOD-PAYING JOB. 7. THIS DECISION opens the way for much broader application of the law. 8. For market discipline to work, BANKS CANNOT EXPECT TO BE BAILED OUT. 9. Judy’s comments were SO OFFENSIVE that I left. Table 1: Examples of causal language, reflecting the anno- tation scheme described in §2.1 (with connectives in bold, CAUSES in small caps, and effects in italics). dous variety of linguistic forms (Wolff et al., 2005). As exemplified in Table 1, possibilities include verbs (1, 2), prepositions/conjunctions (3, 4, 5), adjectives (6), and much more complex expressions. Some of these trickier cases can be handled as idiomatic multi- word expressions (MWEs; 7). Others (8, 9), however, are more structured than typical MWEs: they depend on particular configurations of syntactic relations and slot-fillers, placing them closer to the grammatical end of the continuum of lexicon and grammar. This diversity presents a problem for most se- mantic parsers, which inherit the restrictions of the representational schemes they are based on. Many semantic annotation schemes limit themselves to the argument structures of particular word classes. For example, the Penn Discourse Treebank (PDTB; 117 Transactions of the Association for Computational Linguistics, vol. 5, pp. 117–133, 2017. Action Editor: Christopher Potts. Submission batch: 9/2016; Revision batch: 11/2016; Published 6/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Prasad et al., 2008) includes only conjunctions and adverbials as connectives,1 and PropBank (Palmer et al., 2005) and VerbNet (Schuler, 2005) focus on verb arguments. FrameNet (Baker et al., 1998; Fill- more, 2012) is less restrictive, allowing many parts of speech as triggers. Most importantly, though, all these representations share the fundamental simplifying assumption that the basic linguistic carrier of meaning is the lexical unit. Some (e.g., PDTB and FrameNet) allow MWEs as lexical units, and much work has been done on detecting and interpreting MWEs (see Baldwin and Kim, 2010). But even these schemes overlook es- sential linguistic elements that encode meanings. In example 9, for instance, a lexical unit approach would have to treat so as encoding the causal relationship, when in fact so merely intensifies the adjective; it is the combination of so and the finite clausal comple- ment that indicates causality. A more general approach can be found in the prin- ciples of CONSTRUCTION GRAMMAR (CxG; Fill- more et al., 1988; Goldberg, 1995). CxG posits that the fundamental units of language are CONSTRUC- TIONS – pairings of meanings with arbitrarily com- plex linguistic forms. These forms are often produc- tive, consisting of some fixed elements combined with some open slots for semantic arguments. The form/meaning pairings can be as simple as those in traditional lexical semantics. The verb push, for in- stance, is paired with the meaning force to move, and the verb takes two linguistic arguments (sub- ject and object) corresponding to the two semantic arguments (pusher and pushee). But in CxG, the meaning-bearing forms can be much more complex: so X that Y is a single construction, paired with the meaning X to an extreme that causes Y. The CxG paradigm can anchor semantic interpre- tations to any constellation of surface forms, making it potentially invaluable for computational semantics. Even as it has grown in prominence in linguistics, however, CxG has received relatively little attention in NLP. This is partly because the usual approach to operationalizing CxG is to rebuild the entire NLP pipeline to be “constructions all the way down” – to 1PDTB does include a catch-all AltLex category that captures some additional constructions. However, these phrases are very unpredictable, as they include many words beyond the linguistic triggers. They are also restricted to relations between sentences. explicitly model the interactions and inheritance re- lationships between constructions that produce the final utterance and meaning. Here, we take a different approach. Instead of “constructions all the way down,” we propose a “CON- STRUCTIONS ON TOP” methodology: we use a con- ventional NLP pipeline for POS tagging, parsing, and so on, but add a layer for constructional phenomena that directly carry meaning. Rather than specifying by hand the constraints and properties that charac- terize each construction, we allow machine learning algorithms to learn these characteristics. Causal relations present an ideal testbed for this ap- proach. As noted above, causal relations are realized in extremely diverse ways, demanding an operational- ized concept of constructions. Recognizing causal relations also requires a combination of linguistic analysis and broader world knowledge. Additionally, causal relations are ubiquitous, both in our thinking and in our language (see, e.g., Conrath et al., 2014). Recognizing these relations is thus invaluable for many semantics-oriented applications, including tex- tual entailment and question answering (especially for “why” questions). They are especially helpful for domain-specific applications such as finance, politics, and biology (see, e.g., Berant et al., 2014), where ex- tracting cause and effect relationships can help drive decision-making. More general applications like ma- chine translation and summarization, which ought to preserve stated causal relationships, can also benefit. In the remainder of this paper, we suggest two related approaches for tagging causal constructions and their arguments. We first review an annotation scheme for causal language and present a new corpus annotated using that scheme (§2). We then define the task of tagging causal language, casting it as a construction recognition problem (§3). Because it is so hard to identify the relevant components of a construction (tenses, grammatical relations, etc.), the scheme and task do not explicitly include all of these elements. We instead tag the words that participate in a causal construction as a proxy for that construction. We leave it to annotators (when humans are annotat- ing) or machine learning (during automated tagging) to assess when the full constellation of constructional elements is present. Next, we present Causeway-L and Causeway-S, two versions of a pipeline for performing this task, 118 and compare the two approaches. Both approaches use automatically induced patterns, either syntactic (§4.1) or lexical (§4.2), to find possible lexical trig- gers of causal constructions, as well as their likely arguments. They then apply a mix of construction- specific classifiers and construction-independent clas- sifiers to determine when causal constructions are truly present. We report on three sets of experiments (§5) assessing the two systems’ performance, the im- pacts of various design features, and the effects of parsing errors. The results indicate the viability of the approach, and point to further work needed to improve construction recognition (§6). 2 Causal Language Annotation Scheme and Corpus Causation is a slippery notion (see Schaffer, 2014), so the parameters of annotating causal language require careful definition. We follow the annotation scheme of Dunietz et al. (2015), which we now briefly review. 2.1 Causal Language Annotation Scheme The scheme of Dunietz et al. (2015) focuses specifi- cally on causal language – language used to appeal to psychological notions of cause and effect. It is not concerned with what causal relationships hold in the real world; rather, it represents what causal relationships are asserted by the text. For example, cancer causes smoking states a false causation, but it would nonetheless be annotated. On the other hand, the bacon pizza is delicious would not be annotated, even though bacon may in fact cause deliciousness, because the causal relationship is not stated. The scheme defines causal language as any con- struction which presents one event, state, action, or entity as promoting or hindering another, and which includes at least one lexical trigger. For each instance of causal language, up to three spans are annotated: • The causal connective (required) – the lexical items in the construction signaling the causal rela- tionship (e.g., because of ). The connective anno- tation includes all words whose lemmas appear in every instance of the construction. This excludes elements that can be absent, such as most copu- las, or classes of interchangeable words, such as determiners, whose lemmas can vary between in- stances. • The cause – generally a full clause or phrase ex- pressing an event or state of affairs (e.g., I attended because Joan was the honoree). When an actor – but no action – is presented as the cause (e.g., I prevented a fire.), the actor is annotated as the cause. • The effect – also generally an event or state of affairs, expressed as a complete clause or phrase (e.g., I attended because of Joan.). The cause and effect spans may be absent, e.g., in a passive or infinitive. Several examples, with connectives, causes, and effects annotated, are shown in Table 1. The scheme permits the connective to be discon- tinuous and arbitrarily complex (e.g., a necessary condition of or if is to ). Annotators together established an informal “constructicon” to guide their decisions about what word patterns should be con- sidered connectives. Note that the connective is not synonymous with the construction itself; rather, it is a lexical indicator of the presence of the construction. As noted above, we take the construction proper to in- clude the relevant grammatical relations, constraints on argument type, etc. To address causal language as independently as possible from other phenomena, and to circumscribe the scope of the annotations, several types of relation- ships are excluded: • Causal relationships with no lexical trigger – e.g., John went home early; he felt ill. • Connectives that incorporate a means or result – this includes lexical causatives such as kill (cause to die) and convince (cause via persuasion). Only connectives denoting pure causation (e.g., cause, prevent, because) are annotated. • Connectives that assert an unspecified causal relationship – e.g., the earthquakes have been linked to fracking. • Temporal language – e.g., after I drank some wa- ter, I felt much better. (Relations like temporal or- der are often repurposed to express causality. We are currently developing an enhanced annotation scheme and corpus that accounts for such cases.) Additionally, for practical reasons, arguments are only annotated when they appear within the same sentence as the connective. 119 The scheme labels four different types of causa- tion: CONSEQUENCE, MOTIVATION, PURPOSE, and INFERENCE. It also distinguishes positive causation (FACILITATE) from negative causation (INHIBIT). Our algorithms do not currently make these distinc- tions, so we do not delve into them here. Of course, this scheme does not supply everything a full natural language understanding pipeline would need regarding causal relationships. The scheme does not unpack the argument spans into a richer semantic representation. It also does not cover predi- cates that imply causation, such as kill or convince. Instead, it follows in the tradition of shallow semantic parsing, imposing a canonical representation for cer- tain predicates that abstracts away from the language used to convey them. The shallow semantic pars- ing paradigm enables many applications that only require relevant text spans. It is also the first step towards a full semantic interpretation that includes interpretations of the arguments. 2.2 The BECAUSE Annotated Corpus Based on the data and annotation scheme from Duni- etz et al. (2015), we developed the Bank of Effects and Causes Stated Explicitly (BECAUSE), which was used for all experiments below. It consists of three sets of exhaustively annotated documents: • 59 randomly selected articles from the year 2007 in the Washington section of the New York Times corpus (Sandhaus, 2008) • 47 documents randomly selected from sections 2–23 of the Penn Treebank2 (Marcus et al., 1994) • 679 sentences3 transcribed from Congress’ Dodd- Frank hearings, taken from the NLP Unshared Task in PoliInformatics 2014 (Smith et al., 2014) The corpus contains a total of 4161 sentences, among which are 1099 labeled instances of causal language. 1004 of these, or 91%, include both cause and effect arguments. Many of these documents are the same ones that were annotated in Dunietz et al. (2015), with mi- nor corrections. About 75% of the data is new, but 2We excluded WSJ documents that were either earnings re- ports or corporate leadership/structure announcements, as both tended to be merely short lists of names/numbers. 3The remaining sentences and documents were not annotated due to constraints on available annotation effort. Partial overlap: Allowed Excluded Connectives (F1) 0.78 0.70 Degrees (κ) 1.0 1.0 Causation types (κ) 0.82 0.80 Argument spans (F1) 0.96 0.86 Argument labels (κ) 0.98 0.97 Table 2: Inter-annotator agreement results for the BE- CAUSE corpus. The difference between the two columns is that for the left column, we counted two annotation spans as a match if at least a quarter of the larger one overlapped with the smaller; for the right column, we re- quired an exact match. κ scores indicate Cohen’s kappa. Each κ score was calculated only for spans that agreed (e.g., degrees were only compared for matching connec- tive spans). Corpus Connectivetypes Connective tokens PDTB 8.9% 34.4% Mirza and Tonelli (2014) 29.3% 47.5% FrameNet 69.4% 66.9% Table 3: Percentages of the causal connectives in BE- CAUSE that would be partially or fully annotatable under other annotation schemes. Connectives were grouped into types by the sequence of connective lemmas. annotated by the same annotators. The scheme’s inter-annotator agreement metrics are reproduced in Table 2. Many of the causal constructions in BECAUSE would be harder to annotate in other schemes, as shown in Table 3. We computed these statistics by looking up each connective in the other schemes’ lexica. FrameNet captures many more connectives than the others, but it often represents them in frames that are not linked to causality, making comparison difficult. 3 The Causal Language Tagging Task We define the task of tagging causal language to be reproducing a subset of the annotations of the BECAUSE annotation scheme. We split this task into two parts: 1. Connective discovery, in which the spans of causal connectives are annotated. A connective 120 span may be any set of tokens from the sentence. This can be thought of as recognizing instantia- tions of causal constructions. 2. Argument identification (or argument ID), in which cause and effect spans are identified for each causal connective. This can be thought of as identifying the causal construction’s slot-fillers. We assume as input a set of sentences, each with POS tags, lemmas, NER tags, and a syntactic parse in the Universal Dependencies (UD; Nivre et al., 2016) scheme, all obtained from version 3.5.2 of the Stan- ford parser (Klein and Manning, 2003).4 This task is defined in terms of text spans. Still, to achieve a high score on it, a tagger must respond to the meaning of the construction and arguments in context, just as annotators do. This may be achieved by analyzing indirect cues that correlate with mean- ing, such as lexical information, dependency labels, and tense/aspect/modality information. Compared to the annotation scheme, the task is limited in two important ways: first, we do not dis- tinguish between types or degrees of causation; and second, we only tag instances where both the cause and the effect are present. (Even for connective dis- covery, we only evaluate on instances where both arguments are present, and our algorithms check for spans or tokens that at least could be arguments.) We leave addressing both limitations to future work. Nonetheless, this task is more difficult than it may appear. Two of the reasons for this are familiar is- sues in NLP. First, there is a surprisingly long tail of causal constructions (as we finished annotating, we 4We use the non-collapsed enhanced dependency represen- tation. We could have selected a parser that produces both syn- tactic and semantic structures, such as the English Resource Grammar (ERG; Copestake and Flickinger, 2000) or another HPSG variant. Though these parsers can produce impressively sophisticated analyses, we elected to use dependency parsers because they proved significantly more robust; there were many sentences in our corpus that we could not parse with ERG. How- ever, incorporating semantic information from such a system when it is available would be an interesting extension for future work. Another possible input would have been semantic role label- ing (SRL) tags. SRL tags could not form the basis of our system the way syntactic relations can, because they only apply to lim- ited classes of words (primarily verbs). Also, by examining syntactic relations and undoing passives (see §), we get most of the information SRL would provide. Still, we may include SRL tags as classification features in the future. would still encounter a new construction every 2–3 documents). Second, recognizing these constructions inherits all the difficulties of word sense disambigua- tion; every one of the connectives in Table 1 except examples 6 and 7 has a non-causal meaning. (4 can be used in a discourse sense – roughly, I’m saying this because. . . .) The third reason arises from the com- plexity of causal constructions. Because we allow arbitrarily complex connectives, the space of possible triggers is all subsets of words in the sentence. Thus, the task demands an approach that can trim this space down to a manageable size. 4 Causeway: Causal Construction Tagging Methods Causeway is a system that performs this causal lan- guage tagging task. We implemented two versions of Causeway: Causeway-S, based on syntactic patterns, and Causeway-L, based on lexical patterns. (These are both simple techniques; see §8 for some more complex possibilities we are considering for future work.) Each technique is implemented as a pipeline with four stages: 1. Pattern-based tentative connective discovery. Both techniques extract lexical or lexico-syntactic patterns for connectives from the training data. These patterns are then matched against each test sentence to find tokens that may be participating in a causal construction. 2. Argument identification, which marks the cause and effect spans. 3. A statistical filter to remove false matches. 4. A constraint-based filter to remove redundant connectives. Smaller connectives like to (which is causal in sentences like I left to get lunch)5 are usu- ally spurious when a larger connective includes the same word, like cause X to Y . When a larger and a smaller connective both make it through Stage 3 together, we remove the smaller one. Because argument ID is done before filtering, the arguments output by Stage 2 do not quite represent the cause and effect arguments of a causal instance. 5Following Schneider et al. (2015), BECAUSE considers the “in order to” usage of the infinitive to to carry lexical meaning beyond just marking an infinitival clause. 121 worry/VBP I/PRP nsubj care/VBP I/PRP nsubj because/IN mark advcl Figure 1: A UD parse for the sentence I worry because I care, with the tree fragment corresponding to the because construction bolded. Bolded nodes with unbolded text indicate the slots in the construction. (Care is a dependent of worry in keeping with the UD design philosophy, which maximizes dependencies between content words.) Rather, they represent what the cause and effect spans would be if the connective is indeed causal. (For the same reason, even connective discovery is not complete until the end of the pipeline.) Of course, we lack gold-standard arguments for false-positive connectives. We therefore train argument ID only on instances whose connectives are correct. We now describe each of the two versions of this pipeline in turn, focusing on their differing first and second stages. We also elaborate on the design of the classifier. Throughout, we take the head of a span to be the token that is highest in the dependency tree. For tokens that are equally high (e.g., if the parser erroneously splits a phrase into two sister subtrees), we prefer verbs or post-copular nouns over other nouns, and nouns over other parts of speech. 4.1 Causeway-S: Syntax-Based Tagging The syntax-based approach relies on a simple intu- ition: each causal construction corresponds, at least in part, to a fragment of a dependency tree, where several nodes’ lemmas and POS tags are fixed (see Figure 1). Accordingly, the first stage of Causeway-S induces lexico-syntactic patterns from the training data. At test time, it matches the patterns against new dependency trees to identify possible connectives and the putative heads of their cause and effect argu- ments. The second stage then expands these heads into complete argument spans. TRegex Pattern Matching For pattern matching, we use TRegex (Levy and Andrew, 2006), a grep-inspired utility for matching patterns against syntax trees. During training, the sys- tem examines each causal language instance, gener- ating a TRegex pattern that will match tree fragments with the same connective and argument structure. In the example from Figure 1, the generated pattern would match any tree meeting three conditions:6 • some token t1 has a child t2 via a dependency labeled advcl • t2 has a child t3 via a dependency labeled mark • t3 has the lemma because and a POS tag of IN At test time, TRegex matches these extracted pat- terns against the test sentences. Continuing with the same example, we would recover t1 as the effect head, t2 as the cause head, and {t3} as the connective. TRegex is designed for phrase-structure trees, so we transform each dependency tree into a PTB-like parenthetical notation (see Levin et al., 2014). Patterns involving verbs vary systematically: the verbs can become passives or verbal modifiers (e.g., the disaster averted last week), which changes the UD dependency relationships. To generalize across these, we crafted a set of scripts for TSurgeon (Levy and Andrew, 2006), a tree-transformation utility built on TRegex. The scripts normalize passive verbs and past participial modifiers into their active forms. Each sentence is transformed before pattern extrac- tion or matching. TRegex Pattern Extraction The algorithm for extracting TRegex patterns first preprocesses all training sentences with TSurgeon. Next, for each causal instance with both a cause and an effect, it uses the Dreyfus-Wagner algorithm (Dreyfus and Wagner, 1971) to find a minimum- weight subtree of the dependency graph that includes the connective, the cause head, and the effect head. The algorithm uses this to build the pattern: for each subtree edge rel(A,B), the pattern requires that the test sentence include nodes related by a dependency labeled as rel. If A or B is a connective word, then the pattern also checks for its lemma and POS in the test sentence nodes. Patterns with more than six non-argument, non-connective nodes are discarded as parse errors or otherwise ungeneralizable flukes. The graph for Dreyfus-Wagner includes a directed 6The actual TRegex pattern encoding these conditions is: /ˆbecause [0-9]+$/=connective 0 <2 /ˆIN.*/ <1 mark >(/.* [0- 9]+/=cause <1 advcl >(/.* [0-9]+/=effect)). 122 edge for each dependency in the sentence’s parse tree. For each edge, a back-edge of equal weight is added, unless the back-edge already exists (which UD allows). This allows Dreyfus-Wagner to find a subtree even when it has to follow an arc in reverse to do so. Most edges have unit weight. However, for nodes with multiple parents (which UD also allows), the algorithm would often choose the wrong dependency path, leading to poor generalization. Accordingly, on paths of the form x xcomp−−−→ y csubj | nsubj−−−−−−−→ z (where xcomp indicates open clausal complements), edge costs are slightly decreased. This helps the algorithm prefer the xcomp path connecting x, y, and z, rather than a path such as x nsubj−−−→ z nsubj←−−− y. Similarly, acl (adjectival clause) and expl (expletive) edges are slightly penalized. Syntax-Based Argument Identification Each syntactic pattern inherently encodes the posi- tions in the tree of the cause and effect heads. Thus, matching these patterns is the first step of argument ID, in addition to (tentative) connective discovery. The second step of argument ID is to expand the ar- gument heads into complete spans. In general, most syntactic dependents of an argument’s head are in- cluded in its span. There are two exceptions: 1. Connective words. Under the UD scheme, words that form part of the connective sometimes appear as dependents of the argument head. For example, in A prevents B from C, from appears as a depen- dent of C, but it is really part of the construction for the verb prevent. Following the annotation scheme, we therefore exclude such connective words (and any of their dependents) from the ar- gument span. 2. Words below the head of the other argument. For example, in A because B, B will be a depen- dent of A (see Figure 1). Obviously, however, it should not be included in the cause span. 4.2 Causeway-L: Lexical Pattern-Based Tagging Syntactic parsers often make mistakes, which be- comes especially relevant for syntactic patterns that examine multiple dependencies. This is particularly problematic for the syntax-based pipeline, for which parse errors, either in training or at test time, can prevent patterns from matching. Additionally, the exact syntactic relations present in a given instance may be altered by the presence of other constructions. For example, if a verb appears as a complement of another verb, the path to the subject will have an additional dependency link. We therefore implemented a second algorithm, Causeway-L, that performs connective discovery based on the sequence of word lemmas and parts of speech: instead of extracting and matching parse patterns, it extracts and matches regular expressions. It then uses a conditional random field to label the argument spans, using features from the parse in a more probabilistic way. Connective Discovery with Regular Expression Patterns At training time, we generate regular expressions that will match sequences of connective and argu- ment lemmas. The regexes also make sure that there are tokens in the correct lexical positions for argu- ments. For instance, upon seeing an example like A because B, it would generate a pattern that matches any sequence of lemmas (the effect range), followed by the lemma because with POS tag IN (the connec- tive), followed by any other sequence of lemmas (the cause range).7 Each subpattern is given its own cap- turing group in the regular expression. At test time, each new sentence is turned into a string of lemmas with POS tags for matching. Matching lemmas can be recovered from the capturing groups. Argument Identification with a CRF Unlike syntactic patterns, regular expressions can- not pinpoint the cause and effect arguments; they can only give ranges within which those arguments must appear. Thus, the argument ID stage for this pipeline starts with much less information. We treat the task of argument ID as a sequence labeling problem: given a particular regex-derived connective, the system must identify which tokens are in the cause, which are in the effect, and which are neither. We use a standard linear-chain CRF for this task, implemented with the CRFsuite library.8 7The actual regular expression that encodes this is: (ˆ| )([\S]+ )+?(because/IN) ([\S]+ )+?. 8http://www.chokkan.org/software/ 123 • The lemma of wi • The POS tag of wi • Whether wi ∈ C • The dependency parse path between wi and the token in C that is closest in the parse tree • The absolute lexical distance between wi and the lexically closest token in C • The signed lexical distance between wi and the lexically closest token in C • Whether wi is in the parse tree (false for punctua- tion and several other non-argument token types) • The regex pattern that matched C • The cross-product of the regex pattern feature and the parse path feature • The position of wi relative to C • Whether the lemma of wi is alphanumeric Table 4: Features used for CRF argument ID. For each pos- sible connective C (a set of tokens), features are extracted from each word wi in the sentence. All non-numeric fea- tures are binarized. The features for argument ID are listed in Table 4. 4.3 Voting Classifier for Filtering The pattern-matching stage overgenerates for two reasons. First, due to ambiguity of both words and constructions, not all instances of a given pattern are actually causal. Since, for example, has both causal and temporal senses, which are not distinguished either lexically or syntactically. Second, the patterns do not filter for important constructional elements like tense and modality (e.g., example 8 in Table 1 requires modality of necessity in the cause). Thus, pattern matching alone would yield high recall but low precision. To account for this, the final stage of both pipelines is a filter that determines whether each possible con- nective instance, along with its arguments, is in fact being used in a causal construction. This classification task is somewhat, but not en- tirely, heterogeneous. Some aspects are universal – there are regularities in what typically causes what – crfsuite/. We use the python-crfsuite wrapper (https: //github.com/tpeng/python-crfsuite/). 9For this purpose, we represent the tense of a verb as the POS of the verb plus the string of auxiliaries attached to it. The tense of a non-verb is null. For copulas, both the POS of the copula and the POS of its object are included. Connective features: • The label on the dependency from h to its parent • The part of speech of h’s parent • The sequence of connective words* • The sequence of connective lemmas* • Each pattern that matched the connective* Argument features: • The POS tags of c and e • The generalized POS tags of c and e (e.g., N for either NNP or NNS) • The tenses9 of c and e, both alone and conjoined • The label on each child dependency of c and e • For verbs, the set of child dependency labels • The number of words between c and e • The dependency path between c and e • The length of the dependency path from c to e • Each closed-class child lemma of e and of c • The domination relationship between c and e (dominates, dominated by, or independent) • The sets of closed-class child lemmas of e and c • The conjoined NER tags of c and e • Initial prepositions of the cause/effect spans, if any • Each POS 1-skip-2-gram in the cause/effect spans • Each lemma 1-skip-2-gram in the cause/effect spans that was seen at least 4 times in training • Each WordNet hypernym of c and e† Table 5: Features for the causal language candidate filter. c indicates the cause head, e the effect head, and h the connective head. All non-numeric features are binarized. * Used only for per-connective classifiers. † Used only for the global classifier. while others are construction-dependent. To incorpo- rate both kinds of information, a separate soft-voting classifier is created for each unique sequence of con- nective words. Each classifier averages the proba- bility estimates of three less-reliable classifiers: a global logistic regression classifier, which is shared between all connectives; a per-connective logistic regression classifier; and a per-connective classifier that chooses the most frequent label for that connec- tive. Our classifier thus differs slightly from typical voting classifiers: rather than the ensemble consisting of multiple algorithms trained on the same data, our ensemble includes one generalist and two specialists. We use the scikit-learn 0.17.1 (Pedregosa et al., 2011) implementation of logistic regression with L1 regularization and balanced class weights. The logis- tic regression classifiers consider a variety of features 124 derived from the matched pattern, the connective words matching, the argument heads, and the parse tree (see Table 5). An instance is tagged as causal if the soft vote assigns it a probability above 0.45. This cutoff, close to the prior of 0.5, was tuned for F1 on a different random split of the data than was used in the experiments below. In our experiments, the cutoff made little difference to scores. 5 Experiments 5.1 Baselines Our task differs significantly from existing tasks such as frame-semantic parsing, both in the forms of al- lowable triggers and in the semantic relationships tar- geted. Our results are therefore not directly compara- ble to a frame-semantic parsing baseline. Instead, we compare our end-to-end results against an argument- aware most-frequent-sense (MFS) baseline. At training time, the baseline first extracts the set of sequences of connective lemmas – i.e., {〈prevent, from〉,〈because, of〉, . . .}. It then builds a table t with one entry for each combination of con- nective and argument parse path, recording how many more times it has been causal than non-causal. For example, consider the sentence The flu pre- vented me from attending. After finding that prevent and from are present in the correct order, the baseline considers every pair (c,e) of non-connective words within a parse radius of two links from the connec- tive. Each (c,e) is considered a possible cause/effect head pair for prevent from. For each pair, it finds the shortest path of dependency links from either prevent or from to c and e (call these paths dc and de). If prevent from is annotated as causal, with cause head c and effect head e, the system increments the count t{prevent from,dc,de}; otherwise, it decrements it. The test algorithm finds the same set of possible (connective, cause head, effect head) tuples in the test sentence. For each tuple, if the corresponding entry in t is greater than 0, the system tags a causal language instance. Argument heads are expanded into spans using the algorithm from Causeway-S. In addition to an end-to-end baseline, we wished to test how helpful our three-way voting classifier is. For each pipeline, then, we also compare that classifier against a most-frequent-sense baseline that chooses the most frequent label (causal or not-causal) for each connective, with connectives differentiated by their lemma sequences. The baseline classifier has no access to any information about the arguments. 5.2 Experiment 1: Pipeline Comparison In this experiment, we measured the performance of Causeway-S, Causeway-L, and the baseline on the tasks of connective discovery and argument ID. We also tried taking the union of each system’s outputs with the baseline’s. Because of the small size of BECAUSE, we report averaged metrics from 20-fold cross-validation, with fold size measured by sentence count. All pipelines were run on the same folds. Evaluation Metrics For connective discovery, we report precision, re- call, and F1 for connectives, requiring connectives to match exactly. In counting true positives and false negatives, the only gold-standard instances counted are those with both a cause and an effect, in keeping with the task definition. For argument ID, we split out metrics by causes and effects. For each, we report: • percent agreement on exact spans • percent agreement on heads • the average Jaccard index (Jaccard, 1912) for gold-standard vs. predicted spans, defined as J(A,B) = |A∩B| |A∪B| , where A and B are the sets of tokens in the two spans. This metric reflects how well the argument spans overlap when they do not match exactly. All argument metrics are reported only for correctly predicted connectives, as there is no way to auto- matically evaluate argument spans for false positives. Note that as a result, argument ID scores are not directly comparable between pipelines – the scores represent how well argument ID works given the previous stage, rather than in an absolute sense. We use the same metrics for Experiments 2 and 3. 5.3 Experiment 2: Ablation Studies Our second experiment explores the impact of vari- ous design choices by eliminating individual design elements. We report results from using the global and connective-specific classifiers on their own, with- out the soft-voting ensemble. (The MFS classifier is 125 tested alone as part of Experiment 1.) We also report results from the ensemble classifier without using any features that primarily reflect world knowledge: NER tags, WordNet hypernyms, and lemma skip-grams. 5.4 Experiment 3: Effects of Parse Errors In our third experiment, we examined the effects of parse errors on our pipelines’ performance. We com- pared each pipeline’s performance with and without gold-standard parses, using only the Penn Treebank portion of BECAUSE. For gold-standard runs, we used the Stanford dependency converter (De Marn- effe et al., 2006) to convert the gold parses into de- pendency format. We report averaged results from 20-fold cross-validation on this subcorpus. 6 Experimental Results and Analysis 6.1 Experiment 1 Results Results from Experiment 1 are shown in Table 6. Our most important conclusion from these results is that a classifier can indeed learn to recognize many of the subtleties that distinguish causal constructions from their similar non-causal counterparts. Even our end-to-end baseline offers moderate performance, but Causeway-L outperforms it at connective discovery by over 14 F1 points, and Causeway-S outperforms it by 18 points. The design of the filter is a significant contrib- utor here. The MFS classifier alone substantially underperforms the voting classifier, particularly for the syntactic pipeline. The small connectives filter makes up some of the difference, but the full pipeline still beats the MFS filter by 4.4 points for the lexical system and 6.7 points for the syntax-based system. When our pipelines are combined with the end-to- end baseline, the results are better still, beating the baseline alone by 21.4 points. This supports our hy- pothesis that causal construction recognition rests on a combination of both shallow and deep information. As expected, both pipelines show high recall but low precision for the connective discovery stage. (Much of the remaining gap in recall came simply from the long tail of constructions – about half of connective types never saw a pattern match.) The filter does balance out precision and recall for a bet- ter F1. However, as the filter’s steep drop in recall suggests, more work is needed to upweight positive instances. Examining the classifier scores reveals that the filter is doing a good job of assigning low probability to negative instances: the vast majority of false pattern matches are clustered below a probabil- ity of 0.5, whereas the positives are peppered more evenly over the probability spectrum. Unfortunately, the positives’ probabilities are not clustered on the high end, as they should be. Significant leverage could be gained just from im- proving classification for the connective to. For both pipelines, this one connective accounted for 20–25% of end-to-end false positives and false negatives, and nearly half of all misclassifications by the filter. Many of the remaining errors (about 40%) came from just a few simple but highly ambiguous/polysemous con- nectives, including if, for, and so. For complex con- structions (MWEs or more complex syntactic struc- tures), Causeway-L achieved 42% F1 and Causeway- S achieved 48%. Overall, then, it seems that the classifier is doing well even at the cases that would challenge typical semantic parsing systems, but it needs some features that will allow it to upweight positive instances of a few challenging words. For argument ID, both techniques do reasonably well at recovering exact argument spans, and the HC and HE columns show that even when the exact spans do not match, the key content words are mostly correct. The low Jaccard indices, meanwhile, indi- cate that there is plenty of room for improvement in finding not just heads, but full spans. Interestingly, effects seem to be harder to recover than causes. The likely culprit is the difference in lengths between the two types of arguments. The distribution of cause lengths is skewed toward low numbers, with a peak at 2 and a median of 5, while effects have a smoother peak at 5 with a median of 7 (Figure 2). The difference makes it harder for the system to guess full effect spans, and even for heads there are more plausible options. The length dispar- ity, in turn, is probably due to the fact that causes are likely to be subjects (19%) or nominal modifiers (13%), which skew short, whereas most effects are primary clauses (24%), complements (30%), or direct objects (12%), which are often more complex. 6.2 Experiment 2 Results Results for Experiment 2 are shown in Table 7. The results using single classifiers, rather than an 126 Connectives Causes Effects Pipeline P R F1 SC HC JC SE HE JE Causeway-S w/o classifier 7.3 71.9 13.2 65.0 84.3 39.3 30.4 63.0 30.7 Causeway-S w/ MFS 40.1 37.9 38.6 71.0 87.6 42.0 34.3 64.4 31.9 Causeway-S w/ MFS + SC filter 60.9 36.2 45.1 75.1 92.3 42.9 40.7 75.2 35.8 Causeway-S w/ classifier 51.9 47.6 49.4 68.7 86.9 39.9 38.0 72.5 34.1 Causeway-S w/ classifier + SC filter 57.7 47.4 51.8 67.1 84.4 39.0 37.7 70.7 33.4 Causeway-L w/o classifier 8.1 91.1 14.8 56.8 67.6 33.1 39.5 59.4 30.9 Causeway-L w/ MFS 61.2 34.8 44.1 73.9 84.7 42.3 51.2 74.3 37.6 Causeway-L w/ MFS + SC filter 61.3 33.9 43.5 74.5 85.0 42.5 50.8 74.0 37.4 Causeway-L w/ classifier 59.0 40.6 47.9 74.5 85.9 42.7 53.3 76.1 38.2 Causeway-L w/ classifier + SC filter 60.4 39.9 47.9 74.3 85.8 42.6 53.3 76.4 38.2 Baseline 88.4 21.4 33.8 74.1 94.7 43.7 48.4 83.3 38.4 Baseline + Causeway-S 59.6 51.9 55.2 67.7 85.8 39.5 39.5 73.1 34.2 Baseline + Causeway-L 62.3 45.2 52.3 73.6 88.9 42.8 53.9 78.6 38.7 Table 6: Results for Experiment 1. SC and SE indicate exact span match for causes and effects, respectively; HC and HE indicate percentage accuracy for cause and effect heads; and JC and JE indicate cause and effect Jaccard indices. “SC filter” indicates the filter for smaller overlapping connectives. For the combinations of the baseline with Causeway, the union of the baseline’s and Causeway’s outputs was passed to the SC filter. 1 5 10 15 20 Argument length (in tokens) 0 50 100 150 C ou nt Causes Effects Figure 2: Distributions of cause and effect span lengths in the BECAUSE corpus (for instances with both arguments). ensemble, uphold our design of combining multiple information sources. Even the best non-ensemble filter, the per-connective filter in Causeway-S, under- performs its ensemble counterpart by 3.1 points. Likewise, the results from removing world- knowledge features confirm that this knowledge sig- nificantly assists the classifier beyond what surface- level features alone can provide. World knowledge features add 2.4 points for Causeway-L and 2.8 points for Causeway-S. 6.3 Experiment 3 Results from Experiment 3 are shown in Table 8. As expected, Causeway-S improved significantly with gold-standard parses, whereas Causeway-L gets only a tiny boost. Surprisingly, Causeway-S did not improve from better TRegex matching of connectives per se. In fact, all scores for connective matching from the first stage were worse with gold-standard parses. Instead, the improvement appears to come from argument identification: better parses made it easier to identify argument heads, which in turn made the many features based on those heads more reliable. This is supported by the high argument head accuracy with gold parses. Further, when we ran the baseline on the PTB subcorpus with and without gold parses, we saw a similar improvement. Thus, although the limited data and the classifier’s failure to upweight positives are still the primary handicaps, better parses would be somewhat helpful for at least the syntax-based approach. 7 Related Work Our work is of course based on CxG, which has inspired a number of NLP efforts. On the language- resource side, the FrameNet group, noting the many aspects of meaning that are not fully captured in an analysis of lexical triggers, has begun an extensive project to document and annotate grammatical con- 127 Connectives Causes Effects Pipeline Classifiers Ablated P R F1 SC HC JC SE HE JE Causeway-S – 57.7 47.4 51.8 67.1 84.4 39.0 37.7 70.7 33.4 Causeway-S Both per-connective 40.8 50.4 44.9 65.6 82.8 38.1 36.8 68.8 32.6 Causeway-S Global/most-freq. 47.1 51.2 48.7 66.0 82.3 38.3 36.3 69.1 33.0 Causeway-S Knowledge features 49.5 49.0 49.0 68.9 85.1 39.8 38.3 71.7 33.4 Causeway-L – 60.4 39.9 47.9 74.3 85.8 42.6 53.3 76.4 38.2 Causeway-L Both per-connective 43.6 40.0 41.5 74.5 88.5 42.4 53.9 76.4 38.4 Causeway-L Global/most-freq. 46.3 38.9 42.1 75.6 86.8 43.1 51.3 74.6 37.6 Causeway-L Knowledge features 55.5 38.9 45.5 74.8 87.3 42.9 52.1 75.5 37.7 Table 7: Results for Experiment 2. Unablated full-pipeline results from Table 6 are included for comparison. (a) With automatically parsed data Connectives Causes Effects Pipeline P R F1 SC HC JC SE HE JE Causeway-S w/o classifier 14.9 73.3 24.7 63.6 90.9 40.3 18.1 72.7 25.3 Causeway-S w/ classifier + SC filter 54.7 40.2 45.7 78.7 98.4 44.6 46.0 78.4 36.7 Causeway-L w/o classifier 9.3 84.6 16.7 59.4 68.5 33.1 43.2 62.1 31.8 Causeway-L w/ classifier + SC filter 52.4 37.2 43.2 72.9 84.5 40.0 52.3 73.4 35.7 (b) With gold-standard parses Connectives Causes Effects Pipeline P R F1 SC HC JC SE HE JE Causeway-S w/o classifier 10.2 70.6 17.7 79.4 98.1 45.7 52.8 90.2 41.3 Causeway-S w/ classifier + SC filter 62.7 51.6 56.0 80.2 96.4 45.6 59.0 92.7 43.4 Causeway-L w/o classifier 9.1 84.1 16.4 57.8 68.2 33.3 53.0 68.0 34.4 Causeway-L w/ classifier + SC filter 56.4 37.9 44.3 77.0 85.3 41.8 67.2 83.4 40.4 Table 8: Results for Experiment 3. structions in English (Fillmore et al., 2012). Similar efforts are underway for VerbNet (Bonial et al., 2011) and PropBank (Bonial et al., 2014). On the NLP- tools side, some work has been done on parsing text directly into constructions, particularly through the formalisms of Fluid Construction Grammar (Steels, 2012) and Embodied Construction Grammar (Bergen and Chang, 2005), which take a “constructions all the way down” approach. Some HPSG parsers and formalisms, particularly those based on the English Resource Grammar (Copestake and Flickinger, 2000; Flickinger, 2011) or Sign-Based Construction Gram- mar (Boas and Sag, 2012), also take constructions into account. Thus far, however, only a few attempts (e.g., Hwang and Palmer, 2015) have been made to integrate constructions with robust, broad-coverage NLP tools/representations. Other aspects of our work are more closely related to previous NLP research. Our task is similar to frame-semantic parsing (Baker et al., 2007), the task of automatically producing FrameNet annotations. Lexical triggers of a frame correspond roughly to our causal connectives, and both tasks require iden- tifying argument spans for each trigger. The tasks differ in that FrameNet covers a much wider range of semantics, with more frame-specific argument types, but its triggers are limited to lexical units, whereas we permit arbitrary constructions. Our multi-stage approach is also loosely inspired by SEMAFOR and subsequent FrameNet parsers (Das et al., 2014; Roth 128 and Lapata, 2015; Täckström et al., 2015). Several representational schemes have incorpo- rated elements of causal language. PDTB includes reason and result relations; FrameNet frames often include Purpose and Explanation roles; preposition schemes (e.g., Schneider et al., 2015, 2016) include some purpose- and explanation-related senses; and VerbNet and PropBank include verbs of causation. As described in §1, however, none of these covers the full range of linguistic realizations of causality. The ASFALDA French FrameNet project recently proposed a reorganized frame hierarchy for causality, along with more complete coverage of French causal lexical units (Vieu et al., 2016). Some constructions would still be too complex to represent, but under their framework, many of our insights could likely be merged into mainline English FrameNet. Other projects have attempted to address causality more specifically. For example, a small corpus of event pairs conjoined with and has been annotated as causal or not causal (Bethard and Martin, 2008), and a classifier was built for such pairs (Bethard et al., 2008). The CaTeRS annotation scheme (Mostafazadeh et al., 2016), based on TimeML, also includes causal relations, but from a commonsense reasoning standpoint rather than a linguistic one. A broader-coverage linguistic approach was taken by Mirza and Tonelli (2014). They enriched TimeML to include causal links and their lexical triggers, and built an SVM-based system for predicting them. Their work differs from ours in that it requires argu- ments to be TimeML events; it requires connectives to be contiguous spans; and their classifier relies on gold-standard TimeML annotations. More recently, Hidey and McKeown (2016) au- tomatically constructed a large dataset with PDTB- style AltLex annotations for causality. Using this cor- pus, they achieved high accuracy in finding causality indicators. This was a somewhat easier task than ours, given their much larger dataset and that they limited their causal triggers to contiguous phrases. Their dataset and methods for constructing it, how- ever, could likely be adapted to improve our systems. Our pattern-matching techniques are based on ear- lier work on LEXICO-SYNTACTIC PATTERNS. These patterns, similarly represented as fragments of de- pendency parse trees with slots, have proven useful for hypernym discovery (Hearst, 1992; Snow et al., 2005). They have also been used both for the more limited task of detecting causal verbs (Girju, 2003) and for detecting causation relations that are not ex- clusively verbal (Ittoo and Bouma, 2011). Our work extends this earlier research in several ways. We propose several methods (CRF-based ar- gument ID and statistical classifiers) for overcoming the ambiguity inherent in such patterns. We also take care to ground our notion of causality in a princi- pled annotation scheme for causal language. This avoids the difficulties of agreeing on what counts as real-world causation (see Grivaz, 2010). 8 Conclusion and Future Work With this work, we have demonstrated the viability of two approaches to tagging causal constructions. We hope that the constructional perspective will prove applicable to other domains, as well. Our code and corpus are available at https://github.com/ duncanka/causeway and https://github. com/duncanka/BECauSE, respectively. In the immediate future, we plan to explore more sophisticated, flexible algorithms for tagging causal constructions that rely less on fixed patterns. Two promising directions for flexible matching are tree kernels and parse forests (Tomita, 1985). We are also pursuing a neural, transition-based tagging model. In parallel, we are working to extend our ap- proaches to cases where causality is expressed using temporal language or other overlapping relations. We are developing a further expanded corpus that will include annotations for such cases, and we expect to extend our algorithms to these new annotations. In the longer run, we plan to demonstrate the use- fulness of our predicted causal language annotations for an application-oriented semantic task such as question-answering. Acknowledgments We thank Jeremy Doornbos, Donna Gates, Nora Ka- zour, Chu-Cheng Lin, Michael Mordowanec, and Spencer Onuffer for all their help with refining the an- notation scheme and doing the annotation work. We are also grateful to Nathan Schneider for his invalu- able suggestions, and to the anonymous reviewers and TACL editors for their useful feedback. 129 References Collin Baker, Michael Ellsworth, and Katrin Erk. 2007. SemEval’07 task 19: frame semantic struc- ture extraction. In Proceedings of the 4th Interna- tional Workshop on Semantic Evaluations, pages 99–104. Association for Computational Linguis- tics, Prague, Czech Republic. Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet Project. In Proceedings of the 17th International Conference on Computational Linguistics, volume 1, pages 86–90. Association for Computational Linguistics, Montreal, Canada. Timothy Baldwin and Su Nam Kim. 2010. Multi- word expressions. In Nitin Indurkhya and Fred J. Damerau, editors, Handbook of Natural Language Processing, volume 2, pages 267–292. CRC Press, Boca Raton, FL. Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Abby Vander Linden, Brittany Harding, Brad Huang, Peter Clark, and Christopher D. Manning. 2014. Modeling biological processes for read- ing comprehension. In Proceedings of the 2014 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 1499–1510. Association for Computational Linguistics, Doha, Qatar. Benjamin Bergen and Nancy Chang. 2005. Embod- ied construction grammar in simulation-based lan- guage understanding. Construction grammars: Cognitive grounding and theoretical extensions, 3:147–190. Steven Bethard, William J Corvey, Sara Klingenstein, and James H. Martin. 2008. Building a corpus of temporal-causal structure. In Proceedings of the 6th International Conference on Language Re- sources and Evaluation (LREC 2008), pages 908– 915. European Languages Resources Association, Marrakech, Morocco. Steven Bethard and James H. Martin. 2008. Learning semantic links from a corpus of parallel temporal and causal relations. In Proceedings of the 46th Annual Meeting of the Association for Computa- tional Linguistics on Human Language Technolo- gies (ACL-08 HLT): Short Papers, pages 177–180. Association for Computational Linguistics, Colum- bus, Ohio. Hans Christian Boas and Ivan A. Sag, editors. 2012. Sign-based construction grammar. CSLI Publica- tions, Stanford, CA. Claire Bonial, Julia Bonn, Kathryn Conger, Jena D. Hwang, and Martha Palmer. 2014. PropBank: Se- mantics of new predicate types. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pages 3013–3019. European Languages Resources Asso- ciation, Reykjavik, Iceland. Claire Bonial, Susan Windisch Brown, Jena D. Hwang, Christopher Parisien, Martha Palmer, and Suzanne Stevenson. 2011. Incorporating coercive constructions into a verb lexicon. In Proceedings of the ACL 2011 Workshop on Relational Models of Semantics, pages 72–80. Association for Com- putational Linguistics, Portland, Oregon. Juliette Conrath, Stergos Afantenos, Nicholas Asher, and Philippe Muller. 2014. Unsupervised extrac- tion of semantic relations using discourse cues. In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), pages 2184–2194. Dublin City University and As- sociation for Computational Linguistics, Dublin, Ireland. Ann A Copestake and Dan Flickinger. 2000. An open source grammar development environment and broad-coverage English grammar using HPSG. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC 2000), pages 591–600. European Language Re- sources Association, Athens, Greece. Dipanjan Das, Desai Chen, André F.T. Martins, Nathan Schneider, and Noah A. Smith. 2014. Frame-semantic parsing. Computational Linguis- tics, 40(1):9–56. Marie-Catherine De Marneffe, Bill MacCartney, Christopher D. Manning, et al. 2006. Generat- ing typed dependency parses from phrase structure parses. In Proceedings of the Fifth International Conference on Language Resources and Evalua- tion (LREC 2006), volume 6, pages 449–454. Eu- ropean Languages Resources Association, Genoa, Italy. 130 Stuart Dreyfus and Robert Wagner. 1971. The Steiner problem in graphs. Networks, 1(3):195–207. Jesse Dunietz, Lori Levin, and Jaime Carbonell. 2015. Annotating causal language using corpus lexicog- raphy of constructions. In Proceedings of The 9th Linguistic Annotation Workshop (LAW IX), pages 188–196. Association for Computational Linguis- tics, Denver, CO. Charles J. Fillmore. 2012. Encounters with language. Computational Linguistics, 38(4):701–718. Charles J. Fillmore, Paul Kay, and Mary Catherine O’Connor. 1988. Regularity and idiomaticity in grammatical constructions: The case of let alone. Language, 64(3):501–538. Charles J. Fillmore, Russell Lee-Goldman, and Rus- sell Rhodes. 2012. Sign-based construction gram- mar, chapter The FrameNet constructicon, pages 309–372. In Boas and Sag (2012). Dan Flickinger. 2011. Accuracy vs. robustness in grammar engineering. In Emily M. Bender and Jennifer E. Arnold, editors, Language from a cog- nitive perspective: Grammar, usage, and process- ing, volume 201 of CSLI Lecture Notes, pages 31–50. CSLI Publications. Roxana Girju. 2003. Automatic detection of causal relations for question answering. In Proceedings of the ACL 2003 workshop on Multilingual summa- rization and question answering, volume 12, pages 76–83. Association for Computational Linguistics, Sapporo, Japan. Adele Goldberg. 1995. Constructions: A Construc- tion Grammar Approach to Argument Structure. Chicago University Press, Chicago, IL. Cécile Grivaz. 2010. Human judgements on causa- tion in French texts. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), pages 2626–2631. European Languages Resources Association, Val- letta, Malta. Marti A. Hearst. 1992. Automatic acquisition of hy- ponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguis- tics, volume 2, pages 539–545. Association for Computational Linguistics, Nantes, France. Christopher Hidey and Kathleen McKeown. 2016. Identifying causal relations using parallel Wikipedia articles. In Proceedings of the 54th Annual Meeting of the Association for Computa- tional Linguistics, pages 1424–1433. Association for Computational Linguistics, Berlin, Germany. Jena D. Hwang and Martha Palmer. 2015. Identifica- tion of caused motion constructions. In Proceed- ings of the Fourth Joint Conference on Lexical and Computational Semantics (* SEM 2015), pages 51–60. Association for Computational Linguistics, Denver, CO. Ashwin Ittoo and Gosse Bouma. 2011. Extracting explicit and implicit causal relations from sparse, domain-specific texts. In Proceedings of the 16th International Conference on Natural Language Processing and Information Systems (NLDB ’11), pages 52–63. Springer-Verlag, Alicante, Spain. Paul Jaccard. 1912. The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50. Dan Klein and Christopher D. Manning. 2003. Ac- curate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Com- putational Linguistics, volume 1, pages 423–430. Association for Computational Linguistics, Sap- poro, Japan. Lori Levin, Teruko Mitamura, Davida Fromm, Brian MacWhinney, Jaime Carbonell, Weston Feely, Robert Frederking, Anatole Gershman, and Car- los Ramirez. 2014. Resources for the detection of conventionalized metaphors in four languages. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pages 498–501. European Language Re- sources Association, Reykjavik, Iceland. Roger Levy and Galen Andrew. 2006. TRegex and TSurgeon: tools for querying and manipulating tree data structures. In Proceedings of the 5th Inter- national Conference on Language Resources and Evaluation (LREC 2006), pages 2231–2234. Eu- ropean Language Resources Association, Genoa, Italy. Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schas- berger. 1994. The Penn Treebank: Annotating 131 predicate argument structure. In Proceedings of the Workshop on Human Language Technol- ogy, HLT ’94, pages 114–119. Association for Computational Linguistics, Plainsboro, NJ. Paramita Mirza and Sara Tonelli. 2014. An analysis of causality between events and its relation to tem- poral information. In Proceedings of the 25th Inter- national Conference on Computational Linguistics (COLING 2014), pages 2097–2106. Dublin City University and Association for Computational Lin- guistics, Dublin, Ireland. Nasrin Mostafazadeh, Alyson Grealish, Nathanael Chambers, James Allen, and Lucy Vanderwende. 2016. CaTeRS: Causal and temporal relation scheme for semantic annotation of event structures. In Proceedings of the 4th Workshop on Events: Definition, Detection, Coreference, and Represen- tation, pages 51–61. Association for Computa- tional Linguistics, San Diego, CA. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association, Portoro, Slove- nia. Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An annotated cor- pus of semantic roles. Computational Linguistics, 31(1):71–106. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexan- dre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- sakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn Discourse TreeBank 2.0. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), pages 2961–2968. European Language Re- sources Association, Marrakech, Morocco. Michael Roth and Mirella Lapata. 2015. Context- aware frame-semantic role labeling. Transactions of the Association for Computational Linguistics, 3:449–460. Evan Sandhaus. 2008. The New York Times anno- tated corpus. Linguistic Data Consortium. Jonathan Schaffer. 2014. The metaphysics of causation. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. Summer 2014 edition. http://plato.stanford. edu/archives/sum2014/entries/ causation-metaphysics/. Nathan Schneider, Jena D. Hwang, Vivek Srikumar, Meredith Green, Abhijit Suresh, Kathryn Conger, Tim O’Gorman, and Martha Palmer. 2016. A cor- pus of preposition supersenses. In Proceedings of the 10th Linguistic Annotation Workshop (LAW X), pages 99–109. Association for Computational Linguistics, Berlin, Germany. Nathan Schneider, Vivek Srikumar, Jena D. Hwang, and Martha Palmer. 2015. A hierarchy with, of, and for preposition supersenses. In Proceedings of The 9th Linguistic Annotation Workshop (LAW IX), pages 112–123. Association for Computational Linguistics, Denver, CO. Karin K. Schuler. 2005. VerbNet: A Broad- Coverage, Comprehensive Verb Lexicon. Ph.D. thesis, University of Pennsylvania, Philadelphia, PA. AAI3179808. Noah A. Smith, Claire Cardie, Anne Washington, and John Wilkerson. 2014. Overview of the 2014 NLP unshared task in poliinformatics. In Proceedings of the ACL 2014 Workshop on Language Tech- nologies and Computational Social Science, pages 5–7. Association for Computational Linguistics, Baltimore, MD. Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005. Learning syntactic patterns for automatic hyper- nym discovery. In Advances in Neural Information Processing Systems 17 (NIPS 2004), pages 1297– 1304. MIT Press, Vancouver, Canada. Luc Steels, editor. 2012. Computational Issues in Fluid Construction Grammar. Lecture Notes in 132 Computer Science. Springer Verlag, Berlin, Ger- many. Oscar Täckström, Kuzman Ganchev, and Dipanjan Das. 2015. Efficient inference and structured learn- ing for semantic role labeling. Transactions of the Association for Computational Linguistics, 3:29– 41. Masaru Tomita. 1985. An efficient context-free pars- ing algorithm for natural languages. In Proceed- ings of the 9th International Joint Conference on Artificial Intelligence (IJCAI), volume 2, pages 756–764. Morgan Kaufmann Publishers Inc., Los Angeles, CA. Laure Vieu, Philippe Muller, Marie Candito, and Mar- ianne Djemaa. 2016. A general framework for the annotation of causality based on FrameNet. In Pro- ceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pages 3807–3813. European Language Resources Association, Portoro, Slovenia. Phillip Wolff, Bianca Klettke, Tatyana Ventura, and Grace Song. 2005. Expressing causation in En- glish and other languages. In Woo-kyoung Ahn, Robert L. Goldstone, Bradley C. Love, Arthur B. Markman, and Phillip Wolff, editors, Categoriza- tion inside and outside the laboratory: Essays in honor of Douglas L. Medin, pages 29–48. Ameri- can Psychological Association, Washington, DC. 133 134