key: cord-0132291-e49ltcqb authors: Arora, Aryaman; Venkateswaran, Nitin; Schneider, Nathan title: Hindi-Urdu Adposition and Case Supersenses v1.0 date: 2021-03-02 journal: nan DOI: nan sha: 8154df4a583a2e35b7b1a54bed836868449bf543 doc_id: 132291 cord_uid: e49ltcqb These are the guidelines for the application of SNACS (Semantic Network of Adposition and Case Supersenses; Schneider et al. 2018) to Modern Standard Hindi of Delhi. SNACS is an inventory of 50 supersenses (semantic labels) for labelling the use of adpositions and case markers with respect to both lexical-semantic function and relation to the underlying context. The English guidelines (Schneider et al., 2020) were used as a model for this document. Besides the case system, Hindi has an extremely rich adpositional system built on the oblique genitive, with productive incorporation of loanwords even in present-day Hinglish. This document is aligned with version 2.5 of the English guidelines. This document is supplementary to the SNACS v2.5 guidelines for English . It focusses on phenomena specific to Hindi-Urdu, while also attempting to give illustrative examples of the whole inventory of supersenses. We hope this will be useful in annotating typologically similar languages of South Asia, as well as a contribution to the literature on case in Hindi-Urdu. Taking a page from the Korean guidelines (Hwang et al., 2021) , we also cover a new top-level supersense group CONTEXT. Hindi and Urdu are two Indo-Aryan-family lects that share a nearly identical grammar, and are best characterised as two diverging registers of one pluricentric language (Kachru, 2009) . The combined language is generally called Hindi-Urdu or Hindustani in linguistic literature. While the corpus that was annotated during the creation of these guidelines was written in literary Hindi in the Devanagari script, this document aims to cover both Hindi and Urdu. To that end, all examples are given in transliteration using a system inspired by the International Alphabet of Sanskrit Transliteration (IAST), similar to the rule-based transliteration algorithm used on the English Wiktionary. Hindi and Urdu diverge lexically even in postposition choice, especially in formal or literary contexts. For example, for the LOCUS postposition meaning 'around', Hindi generally uses kī_cāroṁ_or ('on all four sides'; or 'side' < Sanskrit avarā) while Urdu uses ke_ird-gird (< Persian gird 'round'). An attempt is made to give examples from both registers. Following Masica (1993) , we annotated the Layer II and III function markers in Hindi. These include all of the simple case markers 1 and all of the adpositions. 2 Our guidelines on the differentially-marked ergative and accusative cases are also applicable to unmarked verbal arguments, but these were not annotated in the first corpus. We also decided to annotate the suffix vālā when used in an adjectival sense (e.g. chot .ā -vālā kamrā 'the room that is small'), the comparison terms jaisā and 1 ne (ergative), ko (dative-accusative), se (instrumental-ablative-comitative), kā/ke/k (genitive), mem . (locative-IN), tak (allative), par (locative-ON). Declined forms of the pronouns (including the reflexive apnā) were also included. 2 An open class, given the productivity of the oblique genitive ke as a postposition former. jaise, the extent and similarity particle sā (chot .ā -sā kamrā 'small-ish room'), and the emphatic particles bh, hī, to (Koul, 2008, 137-156) . All of these modify the preceding token and mediate a semantic relation between their object and the object's governor, just as conventionally-designated postpositions do. This section covers some of the literature and past work we broadly relied on in constructing these guidelines. The main Hindi grammar we referenced was Koul (2008) . There has been a great deal of work on SNACS across many languages. Those there were generally relevant to this whole document are Schneider et al. (2018 . For annotating verbal arguments, we started with Shalev et al. (2019) which established a baseline for dealing with subjects and objects. Archna Bhatia did some initial work on annotating The Little Prince in Hindi in a much earlier SNACS standard. Comparisons with Korean (Hwang et al., 2021 , German (Prange and Schneider, 2021) , and Gujarati 3 were especially useful in formulating these guidelines. Discussions with the CARMLS research group (particularly Jena Hwang and Vivek Srikumar) and reviewer comments on our work at SIGTYP and SCiL (Arora and Schneider, 2020; Arora et al., 2021) were also instrumental for this work. Spatial expressions and motion. Making sense of the locative cases and their roles as verbal arguments has relied largely on Khan (2009) (to disentangle the various functions of locatives) and Narasimhan (2003) (to understand the framing of motion events). Verbal arguments. Much of the guidelines on annotating PARTICIPANT-type roles deal with verbal argument structure. There is a great deal of work on this issue in both linguistics and computational linguistics for Hindi. In theoretical linguistics, there is Mohanan (1994) , Butt (1993) . Work on case in Hindi includes general work on differential argument-marking (de Hoop and Narasimhan, 2005) , dative subjects (Butt et al., 2006; Mohanan and Verma, 1990) , and typology (Khan, 2009) . The Hindi-Urdu Treebank Project has dominated work on verbal argument structure in computational linguistic work on Hindi. It utilises two models of Hindi syntax: a dependency grammar inspired by the traditional kāraka system (Vaidya et al., 2011) , and a modern phrase-structure grammar (Palmer et al., 2009; Bhatt et al., 2013) . Bhatt says that the two annotations are analogous to Lexical-Functional Grammar (LFG)'s f-structure and c-structure (when traces are removed from the PSG parse). Other projects in this field are the Hindi-Urdu PropBank 4 (Bhatia et al., 2013a; Vaidya et al., 2013) , the separate Urdu PropBank (Anwar et al., 2016; Bhat et al., 2014) , and Urdu/Hindi VerbNet 5 (Hautli-Janisz et al., 2015) . Force dynamics. Some of the biggest issues in porting SNACS to Hindi have been in the realm of force dynamics. Constructions with modal auxiliaries, causatives (Begum and Sharma, 2010) , and forced actors are still issues in the guidelines. These are common constructions in South Asian languages, so a resolution to these issues will be necessary as annotation work moves ahead on other languages (e.g. Gujarati). All examples are written in transliterated form using the International Alphabet of Sanskrit Transliteration (IAST), approximating the spoken pronunciation (i.e. schwa deletion is accounted for). We provide glosses and translations only for illustrative examples in an effort to keep the document concise. The structure of CIRCUMSTANCE and CONFIGURATION is the same as the English guidelines. For PARTICIPANT, each subsection is a case marker or postposition (instead of a supersense) given the varied functions and scene roles taken on by each marker. For reference, below is a supersense index for PARTICIPANT. Note that the genitive marker kā ( §3.5) can nominalise many of these relations. CIRCUMSTANCE is used directly as a scene role when some additional information is added to contextualize the main event. These tend to involve locative postpositions: meṁ, par, etc. ( It is used for setting events, often construed as a LOCUS and perhaps serving as an answer to a location-based question, but the postposition itself does not give an explicit location. It is also used for occasions, when the event is only the background for the action (rather than a cause). Relative time markers such as ke_bād "after" and se_pahle "before" are also included. However, if the difference in time is explicitly stated that the construal TIME;INTERVAL is used. Finally, adpositions that pick out an arbitrary point in time from a duration such as ke_daurān "during", kī_avdhi_meṁ "in the interval of" also take this as scene role and function. Discussion. This is the only context in which ko would create an adverb. It doesn't fit under any other function very well. TIME;GOAL was considered at some point but the grammatical functions are entirely different. It was elected to not mix time and location in construals, following the precedent of . STARTTIME The prototypical postposition is se. 'The war has been raging since years ago.' ENDTIME The prototypical postposition is tak. STARTTIME is an exact counterpart of this, and the ENDTIME;INTERVAL construal applies for a durative use. (13) kal yesterday se ABL kal tomorrow tak ALL 'from yesterday until tomorrow' Discussion. For the durative uses of se and tak it was difficult to come to a consensus on the label; the alternative option (e.g. for se) was DURATION;STARTTIME. We felt that the difference between durative and non-durative was morphosyntactic rather than semantic. The prototypical examples for FREQUENCY are expressed through reduplication (e.g. kabhī-kabhī 'sometimes') rather than a postposition. For iterations marked ordinally with ke_lie, FREQUENCY is used: The river flows till the ocean. Is this a statement of fact about where the river ends (thus LOCUS;GOAL), or is it the present flowing of the river to that endpoint (thus GOAL)? We fall back on the most literal reading (so GOAL) in case of ambiguity. This is part of an open issue cross-lingually, see #120en. The prototypical postposition for this is se, which often takes on the SOURCE function even in other roles. In this function it is comparable to English from. All of the locative postpositions and case markers can take on a GOAL scene role if licensed by a motion verb. Hindi syntactically patterns with verb-framed languages, but path is usually lexicalized in postpositions (Narasimhan, 2003) . Various verbs that indicate connection and take an argument in the comitative, when dealing with dynamic events, are labelled GOAL;ANCILLARY. This is traditionally called the perlative case, which is expressed with the ubiquitous se. Unlike English, there is not much variety in PATH adpositions (over, across, through, as well as uses of static location markers), but postposition stacking is permissible with se. b. vośaitān mere_pīche LOCUS se bhāg gayā! se_hokar also marks a PATH (Narasimhan, 2003, p. 150 ). Discussion. There is a PATH;INSTRUMENT construal in English for e.g. "escape by tunnel", but there does not seem to be anything instrumental about the equivalent Hindi construction, so we just treat it as a PATH. DIRECTION is the static or dynamic orientation of something. The prototypical markers for this are kī_taraf, kī_or, and kī_diśā, all grammaticalised from the literal meaning 'in the direction of'. Distance. Static distance uses the construal LOCUS;DIRECTION, since it refers to a fixed point in space but in a way as to emphasise the distance is movement away from another point. se_dūr and ke_dūr are used in this way. When referring to scalar values or changes on a scale, tak has the role of EX-TENT. kā can function similarly, but take a construal EXTENT;IDENTITY since it equates two things. 'They retaliated with shootings.' (47) zyādā tez bhāgne se t .āṁ g tor . dī. The how of a situation, usually an adverbial phrase. As Hindi is a split-ergative language, showing both nominative-accusative and ergative-absolutive alignment, there are two primary ways to mark a canonical subject: the ergative marker ne (when the verb is in perfective aspect) or the unmarked nominative (in all other instances). CAUSER is an inanimate instigator or force. Only ergative case marker ne really applies this supersense, since the kinds of entities that act as CAUSERs are generally not subject to obligation, necessity, or any other modal framings that cause differential subject marking in Hindi. AGENT is the animate (or construed as such) performer of an action. The AGENT argument to a verb can be expressed with a variety of case markers depending on how the scene is to be framed. Verbs involving producing or creation of something (banānā 'to make'), communication (batānā 'to tell', kahnā 'to say'), and the giving of a possession (denā 'to give') take the role ORIGINATOR;AGENT for their ergative argument. Verbs that involve a volitional experience (dekhnā 'to see', mahsūs karnā 'to feel') take the ergative. Note that these often have dative equivalent that take EXPERIENCER;RECIPIENT as their proto-Agents, e.g. dikhāī denā 'to see'. Verbs in which the ergative subject ends up with possession of an item (lenā 'to take', xarīdnā 'to buy') take this role. Discussion. In the differentially-marked subjects for obligation, necessity, and ability, the AGENTs do not have volition, so that scene role for them is uncertain. This is part of the broader problem of SNACS's treatment of force dynamics cross-lingually, and will not be easily resolved with the current hierarchy. Like in most Indo-Aryan languages, ko is a dative-accusative marker. Both senses seem to constitute a single entry in the lexicon; the difference between a dative ko and an accusative ko is not readily known to a non-linguistically-informed native speaker. Syntactic tests for ascertaining function. The dative ko is obligatory while the accusative ko marks animacy, definiteness, and/or salience. Thus, one can use an indefinite (e.g. a plural) and/or inanimate substitution to test if the ko can be dropped; if it can be, then it is an accusative. See also Bhatt et al. (2013, 72-76) . The various accusative markers are all annotated THEME. A THEME undergoes an action, nonagentive motion, a change of state, or transfer. It is a broad category, best signified by the differentially marked (generally on animate or specific objects) accusative ko. Some compound verbs favour kā or par as their object markers. The pronouns have special accusative forms suffixed with -e(ṁ) (mujhe, tujhe, hameṁ, use, etc.) , which are all treated the same as ko. Other examples of ko marking verbal arguments are below. STIMULUS;THEME marks the source of a volitional experience, such as dekhnā 'to see', sunnā 'to hear'. Some verbs (samajhnā 'to understand', mānnā 'to accept', etc.) license a TOPIC;THEME for their objects (#3). This includes the adjective-verb compound use of samajhnā. 'The woman made the child sleep.' THEME is perhaps not the best label for this, but since there is no special handling of force dynamics, this is the best option in the current hierarchy. The function for this case is RECIPIENT. The canonical example of the dative is an indirect object to which the direct object (THEME) is transferred by the subject (ORIGINATOR (82a)) The postpositions ke_zariye 'via, through' and ke_mādhyam_se 'by means of' (in Sanskritised Hindi) also mark INSTRUMENTs. (103) Gūgal ke_zariye khoj lo. (104) Hindī bhās .ā ke_mādhyam_se ham logoṁ tak pahuṁc sakte haiṁ. Hindi (e.g. khulvānā ' to make X open Y') can take an animate instrument which exhibits AGENT-like properties (Ramchand, 2011) . Currently we annotate these as their predicate-licensed scene role construed as INSTRUMENT. Discussion. One possible change to this is to create a new function for animate instruments: AIDER. Animate instruments can control adverbial phrases while inanimate instruments cannot, animate instruments can control instruments of their own, and a similar distinction already exists between inanimate CAUSER and animate AGENT in the hierarchy (Bhatia, 2016; Begum and Sharma, 2010) , thus it seems strange to say these are still morphosyntactic INSTRUMENTs. An alternative is to treat this as an AGENT and make a new supersense for the initator of the action (which is a volition entity but not an actor itself). This approach is taken by Bill Croft. a a Personal communication. The ablative sense of se (ABL) takes the function SOURCE. (For the literal meaning of motion away, see that section.) Some of the literal ablative uses to mark verbal arguments get the scene role THEME; refer to the English guidelines Here are some of the more grammaticalised uses of ablative se to mark verbal arguments, classified as such based on typological considerations given in (Khan, 2009 A verb in a passive construction (with the light verb jānā "to go") marks the AGENT (with appropriate predicate-licensed scene role) with the instrumental case marker se. This can also be a debilitative construction when negated. The postposition dvārā also marks a passive subject in some dialects and literary Hindi. Given these facts, we elected to make AGENT a valid function for se. The main use of kā to mark a PARTICIPANT is in nominalisations of verb phrases, in which it marks arguments to the verb. (142) 14 sitambar kā din 'hindi-divas' ke_rūp_meṁ manāyā jātā hai. SPECIES is rare in Hindi. The main instance of this is when the governor of the kā-marked NP is a word like misāl or udāhran . 'example'. (143) Bhārȃtīy Indian kalā art kā GEN udāhran . example 'an example of Indian art' Confusion with CHARACTERISTIC. Semantically, the usual translation equivalent of English type of X into Hindi is tarah kā X. Note, however, that the head of this NP is opposite in Hindi: it is X rather than type. That construction with kā is labelled CHARACTERISTIC. GESTALT is the prototypical function of kā, and the genitive forms of pronouns (e.g. merā '1SG.GEN'). Note that the genitives are declined for the gender of their governor. For GESTALT, possession is typically complex or abstract, and usually not alienable (otherwise POSSESSOR is used). As a function, it is also used for nominalisations of verb phrases. Possessive ke_pās. Like in many Indo-Aryan languages, the postposition for 'near' (ke_pās) has come to have a possessive sense. This is labelled GESTALT;LOCUS (or with a subtype scene role). It was elected not to give the function GESTALT to this since it often implies physical on-person possession when contrasted with the genitive kā. 'He has no time on account of (his) studies. Locative subject alternation. The locative case marker meṁ, when applied to a subject of a verb, can indicate a GESTALT;LOCUS, the possessor of a property (Kachru, 1970) . The POSSESSOR label is again associated with genitive kā. This is only for alienable possessions of property (generally physical item, but also less tangible property like data or Bitcoins). Like in English, this includes possessions implying but not explicitly stating previous transfer events. WHOLE largely follows the English guidelines ) in its definitions for Hindi, associated chiefly with the genitive kā and the locative meṁ in constructions with the copula (Kachru, 1970) . The possessed entity is well-defined on its own, yet not alienable in the sense of being unable to exist by its own self: CHARACTERISTIC is expressed through kā and vālā. The difference between the two is the vālā tends to emphasise that its object is only one property (of many) of the governor. While vālā is not a standard postposition, it mediates between nouns and noun-phrases, assigning one as a CHARACTERISTIC of the other. As a noun modifier, binā is often coordinated with the postpositions kā or vālā. In these cases, we do not label binā, but we label the coordinating postposition PARTPORTION. The reasoning is that when binā is dropped, the coordinating postpositions still provide the same semantics (e.g. cīnī vālī cāy 'tea with sugar'). If, however, these can be better interpreted as one whole NP (with the postpositionmarked term being a UD nmod to the head), then plain ENSEMBLE applies. COMPARISONREF is typically marked by se (ABL) 'than', jaisā / ke_jaisā ('like', comparing NPs), and jaise / ke_jaise ('like', adverbial). The latter two are also equivalent to kī_tarah and kī_bhāṁti. Sufficiency/excess. ke_liye handles sufficiency/excess comparisons, and is labelled COMPARISONREF;PURPOSE in such a usage (Fortuin, 2013, 60) . (226) skūl jāne ke_liye vah kāfī bar .ā hai (COMPARISONREF;PURPOSE) Adverbial. The adverbial jaise / ke_jaise can be read as either indicating an analogy (MANNER;COMPARISONREF) or a conclusion (THEME;COMPARISONREF). The latter reading is especially likely for experiencer verbs (e.g. lagnā 'to seem'), in which case one can try paraphrasing with a complementiser: lagtā hai ki.... If the paraphrase works, then the conclusion reading is more salient. Implicit comparison. Implicit comparison (instead of a direct comparison of an attribute) is also indicated COMPARISONREF (Bhatia et al., 2013b) . The traditional emphatic particles (hī 'only', bhī 'also', to contrastive, and some uses of tak 'even') are all labelled FOCUS. They are postposition-like, in that they place emphasis on the preceding element in relation to its governor. (236) maiṁ hī ghar jāūṅgā. (237) tū to ghar nahīṁ jāegā. (238) Rāhul, nām to sunā hī hogā. 6 Special labels When uses quotatively, ko and ke_liye are labelled d. These are equivalent to the English infinitival to, hence we agree with the labelling of Schneider et al. 8, 10, 13, 14, 16, 47 TEMPORAL, 9 THEME, 6, [21] [22] [23] [24] [25] [28] [29] [30] 32, 40, 41, 46 TIME, 8, 9, 9, 10, 39 A Proposition Bank of Urdu SNACS annotation of case markers and adpositions in Hindi SNACS annotation of case markers and adpositions in Hindi A preliminary work on Hindi causatives Adapting predicate frames for Urdu PropBanking Hindi PropBank annotation guidelines Causation in Hindi-Urdu: Care for your instruments and subjects Hindi-Urdu phrase structure annotation guidelines The Structure of Complex Predicates in Urdu Dative subjects Differential case-marking in Competition and Variation in Natural Languages The construction of excess and sufficiency from a crosslinguistic perspective Hindi TimeBank: An ISO-TimeML annotated reference corpus Encoding event structure in Urdu/Hindi VerbNet Annotating Korean adposition semantics Korean adposition and case supersenses v0 A note on possessive constructions in Hindi-Urdu The World's Major Languages Spatial Expressions and Case in South Asian Languages Modern Hindi Grammar The Indo-Aryan Languages Experiencer Subjects in South Asian Languages Argument structure in Hindi Motion events and the lexicon: a case study of Hindi Universal dependencies v2: An evergrowing multilingual treebank collection Hindi syntax: Annotating dependency, lexical predicateargument structure, and phrase structure Draw mir a sheep: A supersense-based analysis of German case and adposition semantics Licensing of instrumental case in Hindi/Urdu causatives. Nordlyd Comprehensive supersense disambiguation of English prepositions and possessives Adposition and case supersenses v2.5: Guidelines for english Preparing SNACS for subjects and objects Analysis of the Hindi Proposition Bank using dependency structure Semantic roles for nominal predicates: Building a lexical resource Index of Construals by Scene Role AGENT;ANCILLARY, 30 AGENT;BENEFICIARY 36 AGENT;INSTRUMENT, 28 AGENT;RECIPIENT, 27 AGENT;WHOLE, 38 BENEFICIARY;ANCILLARY, 30 BENEFICIARY;RECIPIENT, 26 CAUSER;SOURCE, 29 CHARACTERISTIC;BENEFICIARY, 33 CHARACTERISTIC;EXTENT, 16 CHARACTERISTIC;LOCUS, 40 CHARACTERISTIC;STUFF, 40, 43 CIRCUMSTANCE;LOCUS, 8 CIRCUMSTANCE;TIME, 8 COMPARISONREF;LOCUS, 45 COMPARISONREF 25 GOAL;THEME, 25 LOCUS;ANCILLARY, 13 LOCUS;CHARACTERISTIC, 40 LOCUS;DIRECTION, 15 LOCUS;GOAL, 13 LOCUS 36 SOCIALREL;RECIPIENT, 26 STARTTIME;INTERVAL, 10 STIMULUS;ANCILLARY, 26, 30 STIMULUS;SOURCE, 29 STIMULUS;THEME, 23, 24 THEME;ANCILLARY, 30 THEME;COMPARISONREF, 46 THEME 8, 9, 11 ke_liye, 2, 7, 18, 19, 32, 38, 40, 46, 48 8, 9, 11, 12, 17, 25, 27, [36] [37] [38] 40