key: cord-0197198-ppz53hr6 authors: Opitz, Juri; Meier, Philipp; Frank, Anette title: SMARAGD: Synthesized sMatch for Accurate and Rapid AMR Graph Distance date: 2022-03-24 journal: nan DOI: nan sha: f6cf0ecb4cf3d20fa37f0dea04202da15681bfa9 doc_id: 197198 cord_uid: ppz53hr6 The semantic similarity of graph-based meaning representations, such as Abstract Meaning Representation (AMR), is typically assessed using graph matching algorithms, such as SMATCH (Cai and Knight, 2013). However, SMATCH suffers from NP-completeness, making its large-scale application, e.g., for AMR clustering or semantic search, infeasible. To mitigate this issue, we propose SMARAGD (Synthesized sMatch for accurate and rapid AMR graph distance). We show the potential of neural networks to approximate the SMATCH scores and graph alignments, i) in linear time using a machine translation framework to predict the alignments, or ii) in constant time using a Siamese CNN to directly predict SMATCH scores. We show that the approximation error can be substantially reduced by applying data augmentation and AMR graph anonymization. Abstract Meaning Representation (AMR) (Banarescu et al., 2013) represents the meaning of sentences in a graph format. The directed, rooted and acyclic AMR graph indicates the events and entities of a sentence, showing semantic roles and other key relations such as cause, time, purpose, instrument, etc. Because of these properties, AMR graphs are potentially useful for many natural language processing tasks: E.g., AMR has been used for summarization (Liu et al., 2015; Dohare et al., 2017; Liao et al., 2018) and Machine Translation . Often, pairs of AMRs need to be studied, using AMR metrics. Classically, AMRs are compared to assess Inter Annotator Agreement in Sem-Banking or for the purpose of parser evaluation, e.g., using the structural SMATCH metric . Going beyond these applications, researchers have leveraged SMATCH-based AMR metrics for NLG evaluation Manning and Schneider, 2021) , as a basis for a COVID-19 semantics-based search engine (Bonial et al., 2020) , comparison of cross-lingual AMR (Uhrig et al., 2021) , and fine-grained argument similarity assessment (Opitz et al., 2021b) . Many of these extended scenarios, such as AMR-based corpus search (Bonial et al., 2020) , greatly profit from a quick similarity computation. But AMR graph metrics typically suffer from a great time-complexity: Computation of SMATCH is NP-hard (Nagarajan and Sviridenko, 2009) , and it can take more than a minute to compare some 1,000 AMR pairs . To understand that this can become problematic in many setups, consider a hypothetical user who desires exploring a (small) AMR-parsed corpus with only n = 1, 000 instances via clustering. The (symmetric) Smatch metric needs to be executed over (n 2 − n)/2 = 499, 500 pairs, resulting in a total time of more than 6 hours. Given recent interest into meaning representations that cover multiple sentences, such as multi-sentence AMR (O'Gorman et al., 2018) , dialogue AMR (Bonial et al., 2021) or discourse representation structures (Kamp, 1981) , where the alignment search space is much larger (van Noord et al., 2018) , we can anticipate that this problem will aggravate in the future. Testing ways to mitigate these issues, we propose a method that learns to match AMR graphs from a teacher SMATCH, thereby reducing AMR clustering time from hours to seconds. Our contributions are: 1. We explore three different neural approaches to synthesize the canonical AMR metric SMATCH from scratch. 2. We show that we can approximate SMATCH up to a small error, by leveraging novel data augmentation tricks. Our code will be publicly released. Anchiêta et al., 2019) . ii) Wasserstein-alignment based and Weisfeiler-Leman graph metrics that aim to reflect human similarity ratings (Opitz et al., 2021a) . Note that show that skipping the alignment step risks safety hazards in parsing evaluation, underlining the importance of graph alignment. (Lee et al., 2018) . The 'long-range arena' benchmark (Tay et al., 2021) for long-sequence models also includes synthesizing tasks, e.g., 'listOps' (learning to calculate) and Xpath (tracing a squiggly line), which prove challenging even for SOTA architectures. Since structural graph matching with SMATCH is situated at the limit of algorithm complexity, learning it may prove an interesting additional challenge. The SMATCH metric measures the structural overlap of two AMRs. We i) compute an alignment between variable nodes of AMRs and ii) assess triple matches based on the provided alignment. Formally, we start with two AMR graphs G and G with variable nodes X = (x 1 , ...x n ) and Y = (y 1 ...y m ). The goal is then to find an optimal alignment: searching for a map that maximizes the number of triple matches for the two graphs. E.g., assume (a, ARG0, b) ∈ G and (c, ARG0, d) ∈ G . If a = c and b = d, we count one triple match. Finally: Researchers typically use a harmonic mean based overlap score = F 1 = 2P R/(P + R), where P = (triples(G)∩triples(G ))/triples(G) and R = (triples(G) ∩ triples(G ))/triples(G ). Experimental data creation We create training, development and testing data as follows: 1. We parse 59,255 sentences of the LDC2020T02 data with a parser (Lyu and Titov, 2018) to obtain graph pairs that can be aligned; 2. For every parallel graph pair (G, G ), we use SMATCH (ORACLE) to compute an F1 score s, and the alignment a. Thus, We shuffle the data and split it into training, development and test set (54255-2500-2500). The task is to reproduce the teacher ORACLE as precisely as possible. We design and test three different approaches. The first is indirect, in that it predicts the alignment, from which we compute the score. The second directly predicts the scores. The third approach enhances the second, to make it even more efficient. Here, we aim to synthesize the alignment itself (Eq. 1) into an NMT model, as illustrated in Figure 1 . The input to the model is as follows: we linearize the two AMRs and concatenate the linearized token sequences with a special token. The output consists of a sequence x j :y k ... x i :y m ... where in every pair u:v, u is a variable node from the first AMR, which is mapped to a node v from the second AMR. The SMATCH score is then calculated based on the predicted alignment. To predict the node alignments/mapping of variables, we use a transformer based encoder-decoder NMT model. Details about the network structure and hyperparameters are stated in Appendix A.1. In this setup, we aim to predict SMATCH F1 scores for pairs of AMRs directly, in a single step. This means that we directly synthesize Eq. 2 into a neural network and our target is the ORACLE F1 score. To facilitate this mapping, we adapt the convolutional neural network (CNN) of Opitz (2020) , as shown in Figure 2 . The model was originally intended to assess AMR accuracy (Opitz and Frank, 2019), i.e., measuring AMR parse quality without a reference. Taking inspiration from human annotators, who exploit a spatial 'Penman' arrangement of AMR graphs for better understanding, it models directed-acyclic and rooted graphs as 2d structures, employing a CNN for processing. To feed a pair of AMRs, we remove the dependency encoder of the model and replace it with an AMR graph encoder. Moreover, we increase the depth of the network by adding one additional MLP layer after convolutional encoding. A standard mean squared error is employed as loss function. More details about hyperparameters are stated in Appendix A.2. (2019), we aim to make the CNN even more efficient, by alleviating the need for pair-wise model inferences. Hence, instead of computing a shared representation of two CNN-encoded AMRs, we process each representation with an MLP (w/ shared parameters), to obtain two AMR vectors N N (G) and N N (G ). These vectors are then tuned with a distance loss L against ORACLE s: This approach enables extremely fast search and clustering: the required (clustering-)model inferences are O(n) instead of O(n 2 ), since the similarity is achieved with simple linear vector algebra. Vocabulary reduction trick We observe that SMATCH is a metric that measures the structural overlap of two graphs. This means that we can greatly reduce our vocabulary, by assigning each graph pair a local vocabulary (see Figure 3 , 'anonymize'). First, we gather all nodes from AMR pairs a and b, computing a joint vocabulary over the concept nodes. We then relabel the concepts with integers starting from 1. E.g., consider AMR a: (r / run-01 :ARG0 (d / duck)), and AMR b: (x / run-01 :ARG0 (y / duck) :mod (z / fast)). The best alignment is map = {(r, x), (d, y), (∅, z)}. Now, we set the shared concepts and relations to the same index run=run=1 and duck=duck=2 and :ARG0=:ARG0=3 and distribute the rest of the indices r=4, d=5, x=6, y=7, z=8, fast=9, :mod=10. This yields equivalent AMRs a = (4 / 1 :3 (5 / 2)) and b = (6 / 1 :3 (7 / 2) :10 (8 / 9)). The target alignment then equals map = {(4, 6), (5, 7), (∅, 8)}. This strategy greatly reduces the vocabulary size, in our case from 40k tokens to less than 700. Auxiliary data creation trick We also find that we can cheaply create auxiliary gold data. We re-assign different indices to AMR tokens, and correspondingly modify the ORACLE alignment (Figure 3 , 'permute'). In our experiments, we permute the existing token-index vocabularies 10 times, resulting in a ten-fold increase of the training data. We expect that, with this strategy, the model will better learn properties of permutation invariance, which in turn will help it synthesize the algorithm. Output post-processing In the implicitly synthesized alignment algorithm (score & vector synthesis), no further post-processing is required, since we directly obtain the estimated SMATCH scores as output. In the explicitly synthesized alignment algorithm, however, we get map, which is the predicted alignment from the sequence-to-sequence model. In this case, we simply feed map as an argument into Eq. 2, to obtain the scores. Evaluation We compare the predicted scoresŷ against the gold scores y with Pearson's ρ. However, for the model that predicts the explicit alignment (synthesizer I), we can compute another interesting and meaningful metric. For this, we first calculate the average SMATCH score over AMR pairs given the gold alignment map , and then we calculate the average SMATCH score over AMR pairs given the predicted alignment map. Note, that the SMATCH score based on the gold alignment constitutes an upper bound (max). Therefore, the SMATCH score based on the predicted alignment shows us how close we are to this upper bound. Our baseline consists of scores that are computed from a random alignment (random). Consistently, the data extension (aug) is very useful. On the other hand, the vocabulary reduction (voc) is only useful for the NMT model (+27.2 points), whereas the scores are lowered for the CNN-based models (−5.5 for score synthesis, −9.1 for vector synthesis). We conjecture that the CNNs learn SMATCH more indirectly by exploiting token similarities in the global vocabulary, and therefore struggle more to understand the algorithm itself, in contrast to the heavily parameterized NMT transformer that can learn to assess tokens fully from their given graph context. We proposed methods to synthesize the NP-hard AMR alignment algorithm SMATCH, exploring different neural architectures, and data augmentation strategies that help all models to generalize. Our best models increase SMATCH calculation speed by a large factor while incurring only small losses in accuracy that can be tolerated in many use cases. A.1 Sequence-to-sequence network parameters Hyper-parameters for the NMT approach are displayed in Table 2 . The best model is determined on the development data by calculating BLEU against the reference alignments. Hyper-parameters for the CNN approach are displayed in Table 2 . The best model is determined on the development data by calculating Pearson's ρ correlation of predicted scores and gold scores. Sema: an extended semantic evaluation for amr Program synthesis with large language models Abstract meaning representation for sembanking Builder, we have done it: Evaluating & extending dialogue-amr nlu pipeline for two collaborative domains InfoForager: Leveraging semantic search with AMR for COVID-19 research Smatch: an evaluation metric for semantic feature structures Execution-guided neural program synthesis Text summarization using abstract meaning representation A theory of truth and semantic representation. Formal semantics-the essential readings Deep learning for symbolic mathematics Deep-neural-network-based sinogram synthesis for sparse-view ct image reconstruction Abstract Meaning Representation for multi-document summarization Toward abstractive summarization using semantic representations AMR parsing as graph prediction with latent alignment Referenceless parsing-based evaluation of AMR-to-English generation On the maximum quadratic assignment problem AMR beyond the sentence: the multi-sentence AMR corpus AMR quality rating with a lightweight CNN Weisfeiler-Leman in the Bamboo: Novel AMR Graph Metrics and a Benchmark for AMR Graph Similarity Automatic accuracy prediction for AMR parsing Towards a decomposable metric for explainable evaluation of text generation from AMR Explainable unsupervised argument similarity rating with Abstract Meaning Representation and conclusion generation Amr similarity metrics from principles Neuro-symbolic program synthesis Synthesis of algorithm for range measurement equipment to track maneuvering aircraft using data on its dynamic and kinematic parameters Sentence-BERT: Sentence embeddings using Siamese BERTnetworks Artificial neural network based reinforcement learning for wind turbine yaw control SemBleu: A robust metric for AMR parsing evaluation Semantic Neural Machine Translation Using AMR Long range arena : A benchmark for efficient transformers Translate, then parse! a strong baseline for cross-lingual AMR parsing Evaluating scoped meaning representations