key: cord-0489992-zk4bgh32 authors: Kabir, Md. Ahsanul; Phillips, Typer; Luo, Xiao; Hasan, Mohammad Al title: ASPER: Attention-based Approach to Extract Syntactic Patterns denoting Semantic Relations in Sentential Context date: 2021-04-04 journal: nan DOI: nan sha: 6431dc066b544e817f79c34d923b47ef3f394267 doc_id: 489992 cord_uid: zk4bgh32 Semantic relationships, such as hyponym-hypernym, cause-effect, meronym-holonym etc. between a pair of entities in a sentence are usually reflected through syntactic patterns. Automatic extraction of such patterns benefits several downstream tasks, including, entity extraction, ontology building, and question answering. Unfortunately, automatic extraction of such patterns has not yet received much attention from NLP and information retrieval researchers. In this work, we propose an attention-based supervised deep learning model, ASPER, which extracts syntactic patterns between entities exhibiting a given semantic relation in the sentential context. We validate the performance of ASPER on three distinct semantic relations -- hyponym-hypernym, cause-effect, and meronym-holonym on six datasets. Experimental results show that for all these semantic relations, ASPER can automatically identify a collection of syntactic patterns reflecting the existence of such a relation between a pair of entities in a sentence. In comparison to the existing methodologies of syntactic pattern extraction, ASPER's performance is substantially superior. Syntactic patterns within a sentence capture various semantic relationships between the entities in the sentence. For instance, in a sentence, like, Sigmoid is a kind of activation function, the pattern, X is a kind of Y , establishes that sigmoid and activation function share a hyponym-hypernym relationship. Similarly, in a sentence like COVID-19 causes breathing difficulty in some patients, the cause-effect relation between COVID-19 and breathing difficulty is reflected by the pattern X causes Y . Linguists call such patterns syntactic patterns as they use sentential structures to denote certain relationship between the symbolic feature X and Y . Extraction of syntactic patterns is an important natural language processing task, as such patterns can be used to extract entity pairs exhibiting various semantic relationships Patel et al. (2018) ; Volkova et al. (2010) and question answering Jijkoun et al. (2004) . Specifically, hyponym-hypernym patterns can be used for ontology building Klaussner and Zhekova (2011) ; Ghadfi et al. (2014) . Cause-effect patterns can be used for extracting entities from medical text to discover relations between disease, symptoms, and medication Patel et al. (2018) ; Ravikumar et al. (2017) . In existing literature, manual or semi-automatic approaches have been used for extraction of syntactic patterns. Earliest among these works was Hearst's seminal contribution Hearst (1992 Hearst ( , 1998 on finding patterns for hyponym-hypernym relation through manual inspection. Similar manual approaches have also been used for extraction of patterns denoting arXiv:2104.01523v1 [cs.CL] 4 Apr 2021 cause-effect Girju and Moldovan (2002) and meronym-holonym Berl and Charniak (2002) relations. But, manual approach for pattern extraction is laborious and time consuming. Besides, for every new semantic relationship, an independent inquiry needs to be pursued to obtain a collection of syntactic patterns encoding that relationship. Snow et al. Snow et al. (2005) have proposed one of the earliest semi-automatic syntactic pattern extraction method. However, the method is proposed considering only one kind of semantic relationship, hyponym-hypernym. Also, from the methodological aspect, the proposed method uses raw frequency threshold of sentential structures over the corpus for selecting a pattern, which generally produces patterns of poor quality. Subsequent to Snow et al.'s work, another semi-automatic work is proposed van Hage et al. (2006) for extracting meronym-holonym patterns. This method is also based on frequency threshold, and the authors themselves have reported that most of the extracted patterns are false positive. Though the extraction of syntactic patterns is not the focus of most of the works, a number of works have devoted to utilize syntactic patterns for classifying whether a semantic relationship between a pair of entities exist or not Snow et al. (2005) ; KHOO et al. (1998) ; Sorgente et al. (2013) ; Sheena et al. (2016a) ; Phi and Matsumoto (2016) . Note that, extraction of syntactic patterns is orthogonal to the task of relation classification; former extracts syntactic patterns from the sentences reflecting semantic relationship, whereas the latter classifies whether a semantic relationship between a pair of entities exists or not. In this paper our focus is on the former task-extraction of syntactic patterns. Machine learning based methods are also used for predicting semantic relation between a pair of entities in a sentence. Majority of these works Baroni and Lenci (2011a) ; Necşulescu et al. (2015) ; Santus et al. (2015 Santus et al. ( , 2016 ; Shwartz et al. (2016) consider the hyponym-hypernym relationship and solve a binary classification problem to identify whether such a relation holds between a given pair of entities. Such approaches are often designed to achieve high classification accuracy, but they are not capable of extracting syntactic patterns Yu et al. (2015) ; Sanchez and Riedel (2017) ; Nguyen et al. (2017) . To summarize, automatic extraction of syntactic pattern for an arbitrary semantic relation is yet an unsolved task. Developing an automated method for extracting syntactic patterns for an arbitrary semantic relation is a challenging task. While humans can easily recognize syntactic patterns through a neuro-cognitive process that enables them to perceive a subject as structured whole consisting of objects arranged in space or sequence, the same does not hold for a machine learning-based agent, which is better at statistical pattern recognition than syntactic pattern recognition. So, it is no wonder that existing computational NLP and AI research have not ventured much into the automatic identification of syntactic patterns from natural language text. Nevertheless, this task is extremely important, because the performance of many NLP applications, such as classification based on pattern Sorgente et al. (2013) ; Hearst (1992) ; Sheena et al. (2016b) ; Berl and Charniak (2002), question-answering McNamee et al. (2008) ; Jijkoun et al. (2004) , and ontology induction Poon and Domingos (2010) will improve significantly through the automatic recognition of such syntactic patterns. In this paper, we propose ASPER 1 , a generic attention-based deep learning model that can identify syntactic patterns for any semantic relationship. ASPER follows a supervised learning approach-the model is trained through a collection of sentences; for each sentence, an ordered pair of entities are identified and a binary label is provided which denotes whether the entities are involved into a specific semantic relationship in that sentence. The output of the model is a collection of syntactic patterns which reflects the semantic relationship between entities involving in a chosen semantic relationship. By changing the training data, in theory ASPER can return syntactic patterns for any semantic relationship. To obtain the patterns of a given relationship, ASPER uses a bi-directional LSTM with an attention layer, which highlights the part-of sentence (pattern) that are important to decide whether the identified pair of entities in the sentence are involved in that relationship. Importantly, in the data representation, ASPER does not use the embedding vectors of the entities whose relationship is inquired by the model, which compels ASPER to answer the query by discovering syntactic patterns capturing that relationship. Experiments on multiple datasets show ASPER's effectiveness. We claim the following contributions: • We propose ASPER, a novel deep learning model which can extract syntactic patterns of a chosen semantic relationship between entities in a sentence, effectively and efficiently. • Experiments on multiple semantic relationships, such as, hyponym-hypernym, meronym-holonym, and causeeffect show that ASPER can identify most of the previously reported syntactic patterns of these relations. It can also identify a few patterns which have not been explicitly noted in earlier works. In this section, we begin by formally defining the relationship-based syntactic pattern extraction task. We then describe the LSTM architecture of ASPER along with its input representation and loss function. Finally, we describe how ASPER extracts syntactic patterns, and provide a pseudo-code of the end-to-end system. Given a sentence S, and a pair of entities (words or phrases) u, w in S exhibiting a specific semantic relationship R (e.g. hypernymy, co-hypernymy, meronymy, causality, etc.), the task of syntactic pattern extraction is to extract a syntactic pattern, P, which manifests that the entity pairs (u, w) are related through the relation R. To extract such patterns, in this work, we adopt a supervised learning model. As input, the model takes a set of triplets, is a directed pair of entities, S i is a sentence in which words u i and w i co-occur, and y i is a binary label indicating if the directed entity pair (u i , w i ) exhibits the relationship R in the contextual scope of S i ; Λ is the number of distinct triples in T . The objective of the model is to extract all syntactic patterns P such that, P is associated with one or multiple sentences in T indicating that the entity pairs (u i , w i ) in those sentences are related through the relation R. Note that, a syntactic pattern is a sequence consisting of a subset of linguistic elements from the sentences in T , conveying the existence of the corresponding relationship in a human-understandable manner. For example, the sentence LSTM is a type of neural network exhibits hyponym-hypernym relation between LSTM, and neural network. The purpose of ASPER is to extract syntactic pattern, X is a type of Y from this sentence. To successfully extract a syntactic pattern which demonstrates the relationship R in entity pair (u, w) in S, we must first determine whether u and w exhibit the relationship R. To make such distinctions, we train a binary classifier using a supervised approach through a set of training triples, T = {(u, w), S, y}. Since our main objective is to extract syntactic patterns from sentences, a classification model that works with sequential data is needed. In addition, the model should be able to identify the parts of the sentence which contribute the most for making the relationship prediction decision. For these reasons, we use a bi-directional Long Short-Term Memory (Bi-LSTM) Hochreiter and Schmidhuber (1997) augmented with an attention layer Bahdanau et al. (2015) as our binary classifier. The Bi-LSTM model is able to leverage the sequential nature of our sentence representation. Furthermore, as a result of supervised learning, the model's attention layer will be trained to highlight the parts of the sentence that are particularly useful in determining the presence of relationship R between the entity pair (u, w). We can, therefore, observe the attention layer to identify the important sentential constructs, which can then be composed to generate the syntactic pattern, P. For a sentence, the Bi-LSTM model takes a vector-sequence representation of the sentence and outputs a prediction of the binary label. The complete model is shown in Fig. 1 . As shown in the bottom layer of Fig. 1 , the input to the Bi-LSTM is the vector sequence representation of a sentence S. This representation, denoted as, X, has K edge embeddings in a sequence, each with dimension D, where the K edges are obtained from the dependency tree of the input sentence. The vector representation of a sentence, composed of a sequence of edge embeddings, is discussed in detail in Section 1.3. The Bi-LSTM layer, L, takes x i ∈ X as input and outputs two hidden state vectors. The first hidden state vector, − → h i , is the forward state output, and the second hidden state vector, Recall that, the shape of a single sentence representation, X, is K × D, where K is the number of edges in the sentence representation from the dependency tree and D is the dimension of each edge representation. Therefore, when given a single sentence representation, X, the Bi-LSTM layer, L, produces a concatenated output, H, of shape K × 2 * N u , where N u specifies the size of a single hidden vector. For our experiments, we set N u to be equal to 256. Following the Bi-LSTM layer, L, output H is used as input to the attention layer, Att. The attention layer produces, A t , a vector of size K × 1 where each a i ∈ A t is a value within a fixed range, a i ∈ [0, 1]. Each such attention value, a i , will encode the relative importance of edge embedding x i in making the binary classification decision. A t is computed as below. Here W 1 is a trainable matrix of shape 2 * N u × 2 * N u , W 2 is another trainable matrix of shape 2 * N u × 1. The shape of temporary variable Temp is K, on which we apply Softmax activation to retrieve A t . Next, the model uses both A t and H as inputs for the repetition layer, Rep. The repetition layer, Rep, outputs R of shape K × 2 * N u . R is simply the scalar multiplication of each hidden input h i ∈ H with its corresponding scalar attention value, a i ∈ A t . Then, the model uses R as input for the aggregation layer, Agg. The aggregation layer simply computes the columnwise sum of R in order to yield the 2 * N u shape output, A g . In short, A g outputs the weighted sum of H where weights are the attention values. A g is then used as input to a fully-connected layer with sigmoid activation function, whose output is a scalar,ŷ, which denotes the prediction of a binary label, y, of a triplet, t ∈ T . Here W 3 is a randomly initialized weight matrix of shape 2 * N u × 1. Using these constructs, we train the binary classifier using the sentence embeddings generated from a collection of triplets, T . We train the model using standard binary cross-entropy loss: Algorithm 1: Pattern Extraction To identify syntactic patterns from a sentence using machine learning, the sentence should be embedded in a form so that its syntactic structure is preserved. For this, we generate a dependency tree of a sentence Covington (2001) and use it as input to our learning model. The motivation is that the dependency tree of a sentence captures the syntactic structure of the sentence through a parse-tree like structure (see Fig. 2 ). However, the dependency tree only provides a symbolic representation, so we obtain a vector representation of it to be used as input to our model. Given an N -word sentence, S, we obtain a dependency tree of S by using a dependency parser (in our implementation we use Spacy 2.2.3 Honnibal and Montani (2020) to achieve this). Output of this parser is an acyclic directed dependency tree, G = (V, E). An example of such a dependency tree is given in Fig. 2 . As shown in this figure, each vertex, v i ∈ V is a tuple representing a word (or phrase) from S and the part-of-speech tag (e.g. noun, verb, adverb, adjective, etc.) of that word (|V| = N ). The edge-set, E, is the set of all directed edges in the dependency tree with cardinality |E| = M (M < N ). Each dependency edge e ij links a parent vertex v i to a child vertex v j , and is labeled by the type of syntactic dependency (attribute, coordinating conjunction, compound, etc.) between the words at the two end-vertices of the edge. In general, the dependency tree of a sentence, G, may contain many vertices and edges which do not contribute to conveying if entity pair (u, w) share relationship R. For example, consider the sentence: The cat, a type of animal, enjoys laying around and eating. In this sentence, the first half of the sentence is critical in establishing that the cat is a type of animal. Clearly though, enjoys laying around and eating plays no role in establishing a semantic relationship between cat and animal. In our sentence representation, we discard such vertices and associated edges. Specifically, we preserve all vertices and edges which are along the shortest path connecting word pair u, and w. We also preserve descendants of u and w along with the edges which connect u and w to their descendants. Next, we organize the edges of the filtered tree into a fixed ordering, in which the edges in the shortest path between u and w come first, followed by the descendant edges of u and w. In Fig. 2 , the dependence tree of sentence, Like most mammals, dogs have body hair is shown. All the edges along the the shortest path from mammal to dog is part of ASPER's sentence representation. However, most, which is the descendant of mammals, is part of the desired syntactic pattern, Like most Y, X. That is why the descendants of X and Y are also important. But, neither all words on the shortest path, nor the descendants are part of the pattern. That's why ASPER is an attention based approach, so that the important words and edges for patterns can be extracted. To generate representation of a sentence, we embed each of the selected edges in sorted order and compose the resulting ordered edge representations, x k , into a single vector sequence representation of S, X. The embedding of an edge e k = (v i , v j ), x k , is composed of the following: (1) semantic embedding of word (or phrase) corresponding to parent vertex v i , (2) one-hot encoding of part-of-speech tag corresponding to word (or phrase) corresponding to parent vertex v i , (3) one-hot encoding of syntactic dependency between v i and v j , and (4) semantic embedding of word (or phrase) corresponding to child vertex v j . Zero vectors of appropriate dimension are used for the semantic embedding of both the entities u and w. This forces ASPER to use only syntactical structural information entailing from sentence structure for predicting the relation between u and w, ignoring semantic information from these entity pairs. For the words except u and w, we use 512-dimensional universal sentence encoder (USE) vectors Cer et al. (2018) . Note that, one may use other choices, such as word2vec, or Glove, instead of USE. For the part-of-speech tags and syntactic dependencies types, we use one-hot encoding, using 18-dimensional and 58-dimensional vectors, respectively. Therefore, any edge embedding has a fixed dimension D, which is equal to 512 + 512 + 18 + 58 = 1100. Finally, we fix vector sequence X to a fixed-length K (the number of edges) by either removing edge embeddings from the end of the sequence or adding zero-padding vectors of size D. This ensures that any sentence representation, X, is of a fixed size, K × D. Clearly, our sentence representation, X, is agnostic to the relationship R, so it is capable of encoding an arbitrary semantic relationship between a given entity pair u and w. After training the supervised learning model as discussed in Section 1.2, the model can be used for classifying whether an unseen pair of entities (within the context of a sentence) shares a relationship or not. This works for an arbitrary semantic relationship as long as we can gather training data for that relationship. However, our main aim is automatic extraction of syntactic patterns, so we consider each edge in an edge-set as an item and apply frequent itemset mining algorithm ECLAT Zaki (2000) to obtain frequent edge-sets over the sentences of T , which constitutes the desired syntactic patterns. The pseudo-code of ASPER is given in Algorithm 1. For a triplet t(u, w, S) in a given collection T , we first train the model and predict the labelŷ and the attention values, A t , associated to the edges of t (Line 3-6). Next, we consider each t ∈ T , for which the model predicts positively i.e., confirms the existence of relationship R in entity pair (u, w) in the sentence S (Line 7) . For the qualified triples, we normalized the attention values of their edges, and by using an importance threshold, att (a value between 0 and 1), filter out the edges of lesser relative importance (Line 10). As the attention values are on an exponential scale (output of a Softmax function), before applying the threshold, we take the logarithm of the attention values and then use min-max normalization to scale the attention values between 0 and 1 (Line 8-9). Corresponding to each triple, we accumulate an edge-set considering only the important edges (Line 11). Then frequent pattern mining algorithm is used to obtain syntactic pattern-set (Line 14). As ASPER is relation-agnostic, we validate its performance in extracting syntactic patterns for multiple relations; specifically, we choose hyponym-hypernym, cause-effect and meronym-holonym relationships, as these three are wellstudied semantic relations in the literature. We also compare the performance of ASPER with Snow's method Snow et al. (2005) , the only semi-automatic method (to be best of our knowledge) that extracts syntactic patterns. However, Snow's method works only for the hyponym-hypernym relation, so we compare with this method for results on this relation. For the other relations that we experimented with, we are not aware of a method, barring from manual methods Girju and Moldovan (2002) ; Berl and Charniak (2002) , so for these relations, we show results on ASPER only. We use six datasets for showing the performance of ASPER. The statistics of the datasets are shown in Table 1 . Among these, LEX, RND are used for hyponym-hypernym pattern extraction; SemEval, ADE datasets are used to perform cause-effect pattern extraction, and the remaining, Bless and Phi's datasets are used for meronym-holonym pattern extraction tasks. Our problem formulation requires context sentences for the entity pairs, but four of the six datasets do not have any context sentence associated to the entity-pair. We obtain context sentences from Wikipedia. For this, we download the latest wikipedia dump and extract all the sentences. Then, if a pair of entities co-occur in a sentence, we extract and associate that sentence with the entity pair. Note that, in this way, a given pair can be associated to multiple sentences. Also, important to understand that not every sentence has a pattern even if the sentence contains an entity-pair. On some occasions, sentences merely list a pair of entities, but do not imply a relationship between them in the sentential context. The pattern extraction methods (ASPER and Snow's method) only extract patterns if the model predicts that a relation between a pair of entities exists in the sentential context. More details of these datasets are provided below. LEX & RND: These datasets are obtained from Shwartz et al. (2016) . They list a set of entity pairs with a label denoting whether the entity pair have a hyponym-hypernym relation (positive) or not (negative) without context sentences for an entity-pair. As discussed above, we use Wikipedia for obtaining context sentence for an entity-pair. Since multiple Wikipedia sentences can be associated to a given entity pair, for both the datasets, we allow at most five sentences to be associated to an entity pair. Both LEX and RND datasets are balanced having the same number of positive and negative sentences. Also, these datasets are already split into train, test, and validation partition which we respected. In LEX dataset, disjoint entity pairs are used in train and test partition; while RND is split randomly, so same entity pair may appear in train, validation, and test partitions, but with distinct sentences. Bless: We use this dataset for evaluating meronym-holonym pattern extraction. It was used in Baroni and Lenci (2011b) for classifying different semantic relationships. It does not have any context sentence, so we extract sentences from Wikipedia for these pairs. Since this dataset has entity pair for many relations, we consider meronym-holonym entity pairs as positive class and others as negative class. For both positive and negative entity pairs, we allow at most 3 sentences for each pair. Finally, we maintain positive and negative sentence ratio as 1:1; split the dataset into train, test and validation maintaining 50%, 40%, 10%, respectively. Phi: This dataset is used in this paper Phi and Matsumoto (2016) , in which authors (Phi et al., whose name is used for naming this dataset) used word embedding for extracting different kinds of meronym-holonym relationship between entities. We use this dataset for evaluating meronym-holonym pattern extraction. This dataset contains only positive pairs with different kinds of part-whole relationships, such as, component-of (11.2%), member-of (22.21%), stuff-of (18.89%), participates-in (15.23%), etc. as labels. For the negative pair sentence instances, we borrow from Bless dataset. We maintain positive negative sentence ratio as 1:1 so that the dataset is balanced. Finally, We split the dataset randomly for train, test and validation partitions maintaining 50%, 40%, and 10% instances respectively. SemEval: This is a well-used dataset, built by combining the SemEval 2007 Task 4 dataset Girju et al. (2007) and the SemEval 2010 Task 8 dataset Hendrickx et al. (2010) . The SemEval datasets provide predefined positive and negative sentences with corresponding entity pairs. The datasets also include predefined train and test partitions. For building validation partition, we borrow from the train partition. Unlike the previous datasets, this dataset is imbalanced. The ratio of positive and negative sentences is 1:5. And the ratio of train, test, and validation sentences is 7:3:1. In order to have entity pairs for each negative sentence, we randomly obtain two noun phrases from each negative sentence. This dataset is balanced in terms of number of sentences and train, test, and validation partitions contain 50%, 40% and 10% data respectively. For training ASPER, we first utilize LSTM model for identifying important dependency tree edges which later constitute the syntactic patterns. This step involves a few user-defined parameters: K (the maximum number of dependency tree edges in the sentence representation), batchSize (total number of train instances in a batch), the size of hidden layer (N u ) in a Bi-LSTM unit, and the learning rate. We fix the hidden layer size at 256, without tuning. We use Adam For extracting patterns from the important dependency edges by using frequent itemset mining, we have two hyperparameters, supp (minimum support threshold in percentage), conf (minimum confidence threshold in percentage). Another hyper-parameter is att (Attention threshold) which is used to filter the important edges. att is tuned for the values between 0.1 to 0.9 at 0.1 interval. We get good patterns for att = 0.6 for all the datasets except Phi where att = 0.1 works well. Both supp and conf are tuned using a validation set from values between 0.1% to 3.0% at 0.1 interval; the patterns that we obtain from validation set are manually scanned to choose the optimal values of supp and conf . For small value of these parameters, we find noisy and incomplete patterns, which do not qualify as syntactic patterns of a relation. Alternatively, if those values are too large, we find too few patterns. We find that small support and confidence threshold work the best as they obtained larger patterns, denoting a full syntactic pattern, conveying a semantic relationship. For LEX and RND optimum supp values are 0.28% and 1.3%; optimum conf values are 0.7% and 0.7%. For the SemEval Combined, ADE, optimum supp values are 0.3%, 0.5% and the optimum conf values are 0.5%, 0.4%. Finally, for Bless and Phi datasets supp values are 0.3%, 1.0% and the optimum conf values are 0.4%, 0.5% respectively. We perform ablation study over supp (results shown in Section 2.6). Evaluating a pattern extraction is a difficult task as the ground truth for a pattern extraction method is not available. Existing works, manual or semi-automated only perform a qualitative evaluation. In this work we have proposed two quantitative metrics for evaluating the performance of pattern extraction. We discuss them below. Our first evaluation method builds ground truth by manually extracting patterns directly from the sentences in a dataset. Unfortunately, such an effort is time consuming and difficult for large datasets. So such an evaluation is only possible by sampling a subset of sentences in a dataset. So, given a potentially large test dataset, we first choose a random subset of sentences (around 1000) from the positive class (where the entity pair in the sentence exhibit the relation). For each of these sentences, we manually extract the pattern and make a ground truth pattern set over a sample of the dataset. If P t is the total pattern set and P o is the obtained pattern set over the same sentences in the sample of the dataset, the following equations define precision and recall of pattern extraction by a method. A problem with the previous evaluation metric is that it is computed over a random sample of sentences in the dataset, not the entire dataset. In fact, it is impractical to extract pattern manually over all the sentences in a dataset. But for any semantic relations, there generally exist a finite number of patterns, and it is easier to validate these patterns without observing them in the sentential context. In this evaluation method we manually evaluate the precision of extracted patterns (over the entire dataset) by a method without evaluating them in the sentences. In other words, all the correctly predicted patterns in an extracted pattern set is considered to be the ground truth, and precision is computed as the ratio of correctly predicted patterns over all the extracted patterns. If we have more than one pattern extraction methods, we collect all the correctly predicted patterns by all of the methods and consider that to be the ground truth pattern-set and report precision on the basis of this set. Evaluation on pattern is easier because the number of patterns generally in less than a hundred for a given semantic relation, and manual evaluation of a pattern is still possible without considering it However, note that in this kind of evaluation, a method is not penalized for not discovering a pattern as long as no other competing methods is able to discover that pattern. In this section we first discuss the performance of ASPER for its ability to extract patterns for three distinct relationships: Hypernym-Hyponym, Cause-Effect, and Meronym-Holonym, from six datasets, two for each relation. In Table 2 , we show the results for all the datasets for the precision, recall, and F 1 metrics using sentence based evaluation. Over all the datasets and various relations, ASPER's performance is the best for detecting patterns for Hyponym-Hypernym relation with F 1 score of 0.74 and 0.88 on Lex and RND datasets, respectively. The poorest performance was for the Meronym-Holonym pattern with 0.54 and 0.64 F 1 score on two of its datasets. The reason for the best performance for Hyponym-Hypernym relation is possibly due to well-established patterns for expressing this relation in a sentence. For the other two relations, the syntactic patterns are more fluid and hence, hard to recognize by an automated method. That means, even if a pair holds a semantic relation, only a few sentences have a syntactic pattern. We observe this from sampled test dataset which is labelled manually. Similar argument holds for ADE cause-effect dataset. The performance on Semeval dataset is comparatively better. This dataset was created for competition and many of the sentences in this dataset are constructed with true cause-effect patterns. Finally, although there are already sentences for Phi dataset, the sentences do not always have consistent syntactic patterns. In Table 2 we show the results by using the evaluated on pattern approach. The finding is very similar to the results in Table 2 . Note that for the evaluation metric based on pattern, only precision is shown. This is due to the fact that for pattern based evaluation when only one extraction method is used, we have no knowledge about false-negative, so recall cannot be computed. We compare the performance of ASPER with that of Snow et. al Snow et al. (2005) work, the latter works only for Hyponym-Hypernym pattern extraction. So we show comparison results on Lex and RND datasts for the Hyponym-Hypernym pattern extraction task. This comparison result is shown in the bar chars of Figure 4 using precision, recall and F 1 values of both the pattern evaluation metrics. Both the methods are tuned for the highest F 1 score. As we can see from the bar chart, for both the datasets (Lex on the Left, RND on the right), with respect to both evaluation metrics, ASPER beats Snow's method significantly. In fact, precision, recall, and F 1 of Snow's method are substantially lower (50% lower) than ASPER for both the evaluation metrics in both the datasets. Although we could not compare ASPER with other methods for meronym-holonym and cause effect patterns extraction, the results in Table 2 and 2 clearly indicate that, ASPER performs well for extracting patterns for other relations. In Figure 3 , we show some of the patterns extracted by ASPER. The first column shows the human readable patterns; the second column shows the entity pairs which exhibit the semantic relationship, and finally the third column shows Aasu is a hyponym of the hypernym, village. For this hyponym-hypernym pair, the sentence A hiking trail leads to the village of Aasu is extracted from Wikipedia from which ASPER identifies the dependency edges village → the, village → of and of → Aasu; from the attention values and itemset mining. If we replace Aasu with X, and village with Y we get the hyponym-hypernym pattern, the Y of X in the first column, which is not reported by Hearst and Snow Hearst (1992) ; Snow et al. (2005) . Along with finding new hyponym-hypernym patterns, ASPER re-discovers most of the Hearst patterns. Similarly, the third row of cause-effect patterns in Figure 3 ; Y generated by X is a pattern which is not reported by KHOO et al. (1998) , and the fourth row, Y influenced by X is not used by Sorgente et al. (2013) for classification. For meronym-homonym, Y element for X is not used by Sheena et al. (2016a) . The main hyper-parameter of ASPER which affect its pattern extraction performance are supp, and conf . We show ASPER's performance over varying supp values by keeping conf value fixed for one dataset for each of the relations. The trend with varying conf is similar to supp's variation, hence not shown. The findings are shown in Figure 5 . In this Figure, for each plot, support values are shown along the x-axis and performance values (precision, recall, F 1 ) are shown along the y-axis. From all the three plots, the F 1 -score values increase as supp increases reaching at the peak, then gradually decreases. With larger supp precision always increases, as with higher support more stringent requirements is imposed for the selection of a pattern. On the other hand, the recall curves always go downward direction since the number of predicted patterns decreases as supp increases. authors reported to find 1000 snippets, and 4503 unique patterns for 503 part-whole pairs. Top 300 frequent patterns out of 4503 patterns are manually validated and they claim to get only 12 correct patterns. We present ASPER, a novel deep learning model which can extract syntactic patterns shared between entity pairs within a sentential context to convey a semantic relation. It works for any relation, it can predict the existence of a relation, and it can also extract syntactic patterns of that relation-a unique feature that no existing method can offer. The experimental results show that ASPER can extract all known syntactic patterns of a relation, including a few new patterns which are not explicitly stated in the previous works. The future work of this research includes applying ASPER to extract syntactic patterns of other semantic relationships. Another research goal is to use the patterns, specifically cause-effect patterns to extract entity-pairs showing relationships between disease, symptoms, and medication. Authors are committed to reproducible research, and will release code, and ground truth datasets once the paper is accepted. X, a class of Y Core 2 Duo microprocessor Core 2 Duo, a class of early Desktop micro-processor had much lower core frequency and approximately the same FSB frequency and level 2 cache size as Pentium D microprocessors A class of Y, X Core 2 Duo microprocessor A class of early Desktop micro-processor, Core 2 Duo, had much lower core frequency and approximately the same FSB frequency and level 2 cache size as Pentium D microprocessors. X be a class of Y Core 2 Duo microprocessor Core 2 Duo is a class of early Desktop micro-processor which had much lower core frequency and approximately the same FSB frequency and level 2 cache size as Pentium D microprocessors. X, a family of Y Vinyasa yoga Vinyasa, a family of yoga is dynamic and A family of Y, X ever-flowing. X be a family of Y X, a type of Y system computer The system software, a type of computer A type of Y, X software software software is designed for running the X be a type of Y computer hardware parts and the application programs X, a kind of Y panda bear Panda, a kind of bear is found only in A kind of Y, X China. X be a kind of Y Y, including X Asiatic black bear Some species of bears, including Asiatic Y which/that bear black bears and sun bears, are also include X threatened by the illegal wildlife trade. Y include X Y, such as X sheep domesticated Domesticated animals, such as sheep or Y, for example X animal rabbits, may have agricultural uses for Y, like X meat, hides and wool. like many Y, X The Y of X Y as X Emperor band Since the 1990s, Norway's export of black metal, a lo-fi, dark and raw form of heavy metal, has been developed by such bands as Emperor, Darkthrone, Gorgoroth, Mayhem, Burzum and Immortal. Y "X" Clarens village A commission was appointed in 1912 to finalize negotiations, and a decision was made to name the village "Clarens" in honour of President Paul Kruger influence in the area. Neural machine translation by jointly learning to align and translate How we BLESSed distributional semantic evaluation How we blessed distributional semantic evaluation Finding parts in very large corpora Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping Universal sentence encoder for English A fundamental algorithm for dependency parsing Building ontologies from textual resources: A pattern based improvement using deep linguistic information Text mining for causal relations SemEval-2007 task 04: Classification of semantic relations between nominals Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports Automatic acquisition of hyponyms from large text corpora Automated discovery of wordnet relations SemEval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals Long short-term memory 2020. spaCy 2.2.3: Industrial-strength natural language processing Distant supervision for relation extraction with sentence-level attention and entity descriptions Information extraction for question answering: Improving recall through syntactic patterns Automatic extraction of cause-effect information from newspaper text without knowledge-based inferencing Lexico-syntactic patterns for automatic ontology building Semantic relation classification via bidirectional lstm networks with entity-aware attention using latent entity typing Knowledge-oriented convolutional neural network for causal relation extraction from natural language texts Learning named entity hyponyms for question answering Reading between the lines: Overcoming data sparsity for accurate classification of lexical relationships Hierarchical embeddings for hypernymy detection and directionality Syntactic patterns improve information extraction for medical search Integrating word embedding offsets into the espresso system for part-whole relation extraction Unsupervised ontology induction from text Belminer: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences How well can we predict hypernyms from word embeddings? a dataset-centric analysis Nine features in a random forest to learn taxonomical semantic relations EVALution 1.0: an evolving semantic dataset for training and evaluation of distributional semantic models Automatic extraction of hypernym & meronym relations in english sentences using dependency parser Automatic extraction of hypernym & meronym relations in english sentences using dependency parser Attention-based convolutional neural network for semantic relation extraction Simple bert models for relation extraction and semantic role labeling Improving hypernymy detection with an integrated path-based and distributional method Hypernyms under siege: Linguistically-motivated artillery for hypernymy detection Learning syntactic patterns for automatic hypernym discovery Automatic extraction of cause-effect relations in natural language text A method for learning part-whole relations Boosting biomedical entity extraction by using syntactic patterns for semantic relation discovery Relation classification via multi-level attention cnns Enriching pre-trained language model with entity information for relation classification Connecting language and knowledge with heterogeneous representations for neural relation extraction Learning term embeddings for hypernymy identification Scalable algorithms for association mining Graph convolution over pruned dependency trees improves relation extraction We describe a case of interstitial hypoxaemiant pneumonitis probably related to flecainide in a patient with the LEOPARD syndrome, a rare congenital disorder. . X result in Y flucloxacillin fatal hepatic injury It is well-recognized that flucloxacillin may Y be result of X occasionally result in fatal hepatic injury. Y from X exertion satisfaction I have always drawn satisfaction from exertion, straining my muscles to their limits. . Y be triggered by X earthquake tsunami A large tsunami is triggered by the earthquake X trigger Y spread outward from off the Sumatran coast. Y come from X fear blockage Sometimes the blockage comes from fear, as for a CEO who hates public speaking but must give frequent speeches. . Y be the effect of X acupuncture pain relief Pain relief is the effect of acupuncture which lasts Y, the effect of X for an extended period of time, sometimes months after the needle was removed. X produce Y Ambient irritation Ambient vanadium pentoxide dust produces Y produced by X vanadium pentoxide dust irritation of the eyes, nose and throat.X promote Y antiwar demonstrators positive values He created and advocated flower power,"a strategy in which antiwar demonstrators promoted positive values like peace and love to dramatize their opposition to the destruction and death caused by the war in Vietnam." X generate Y tunable laser optical signal The optical signal is generated by a tunable laser. Y generated by X X influence Y tumorigenicity of clones immunoprotective effectsThe tumorigenicity of clones may be influenced by immunoprotective effects. Y influenced by X Y due to X Incorrect design Failure in physical containmentFailures in physical containment may occur due to incorrect design. Y because of X