key: cord-0188114-tza94rrl authors: Shichel, Yotam; Kalech, Meir; Tsur, Oren title: With Measured Words: Simple Sentence Selection for Black-Box Optimization of Sentence Compression Algorithms date: 2021-01-25 journal: nan DOI: nan sha: 1c2a89eca437ce92ebcc1c50b8e0729d611b02ea doc_id: 188114 cord_uid: tza94rrl Sentence Compression is the task of generating a shorter, yet grammatical version of a given sentence, preserving the essence of the original sentence. This paper proposes a Black-Box Optimizer for Compression (B-BOC): given a black-box compression algorithm and assuming not all sentences need be compressed -- find the best candidates for compression in order to maximize both compression rate and quality. Given a required compression ratio, we consider two scenarios: (i) single-sentence compression, and (ii) sentences-sequence compression. In the first scenario, our optimizer is trained to predict how well each sentence could be compressed while meeting the specified ratio requirement. In the latter, the desired compression ratio is applied to a sequence of sentences (e.g., a paragraph) as a whole, rather than on each individual sentence. To achieve that, we use B-BOC to assign an optimal compression ratio to each sentence, then cast it as a Knapsack problem, which we solve using bounded dynamic programming. We evaluate B-BOC on both scenarios on three datasets, demonstrating that our optimizer improves both accuracy and Rouge-F1-score compared to direct application of other compression algorithms. Sentence Compression is the task of generating a short, accurate, and fluent sentence that preserves the essence of a given original sentence by removing nonessential words and/or rephrasing it in a compact form. Compression can take many forms, ranging from Extractive and Abstractive Summarization (Jing, 2000; Madnani et al., 2007; Lapata, 2008, 2009; Galanis and Androutsopoulos, 2010; Rush et al., 2015; Chopra et al., 2016) to Text Simplification and Paraphrasing (Bannard and Callison-Burch, 2005; Xu et al., 2012; Klerke et al., 2016; Narayan et al., 2017; Aharoni and Goldberg, 2018; Botha et al., 2018) , among others. On the sentential level, compression is often viewed as a word deletion task Marcu, 2000, 2002; Filippova and Strube, 2008; Filippova et al., 2015; Wang et al., 2016 Wang et al., , 2017 Zhou and Rush, 2019) . However, not all sentences could, or should be compressed as part of compressing a longer text they reside in. Consider the familiar scenario in which a full paragraph needs to be compressed in order to have an EACL paper meet the page restriction specified in the submission guidelines. A common approach by L A T E X users is to first identify paragraphs ending with a short line, (e.g., this very paragraph), then choose one or more sentences that could be compressed with a minimal loss of information -shaving the extra line. We propose a Black-Box Optimizer for Compression (B-BOC) that mitigates this problem. Given a compression algorithm A, a desired compression ratio, and a document D, B-BOC chooses the best sentences to compress using A in order to produce a shorter version of D, while keeping the other sentences of D untouched. B-BOC achieves that without explicit knowledge of the inner-workings of the given compression algorithm, hence we call it a black-box optimizer. Selected sentences are expected to be the best candidates for compressionbalancing compression rate with compression quality. This paper addresses two main research questions: (1) How to predict the compression performance (preserving meaning and grammar) of an algorithm on a given sentence? (2) Given a document and a required compression ratio, how to choose the optimal subset of sentences to compress, along with the appropriate compression ratio per each of the sentences, so that the total compression meets the required compression requirement? Given a gold set of pairs of sentences and their compressions, we represent each sentence as a vector of shallow and syntactic features, and train a regression model to predict its expected compression rate. B-BOC ranks all sentences by the predicted compression potential while considering a required compression ratio. The document-level task could be modeled as a Knapsack optimization problem, considering the subset of sentences to be compressed in order to satisfy the overall compression requirement (capacity), with a minimal loss of information (value). The solution space covers the trade-off between aggressively compressing only a few sentences and applying minimal compression on a larger number of sentences. While the general Knapsack is NPcomplete, the 0-1 variation can be approximated efficiently by using Dynamic Programming (Hristakeva and Shrestha, 2005) . We evaluate B-BOC on three benchmarks commonly used for the sentence compression task. We show that applying B-BOC on top of state-of-theart sentence compression models improves the performance for any desired compression rate. In addition, optimizing the B-BOC-Knapsack achieves top performance on the document-level task. Early sentence compression works employ the noisy channel model, learning the words and clauses to be pruned Marcu, 2000, 2002; Filippova and Strube, 2008; Clarke and Lapata, 2008; Cohn and Lapata, 2009) . The top-performing sentence compression models use a Policy Network coupled with a Syntactic Language Model (bi-LSTM) evaluator (Zhou and Rush, 2019) , and a stacked LSTM with dropout layers (Filippova et al., 2015) . An extension of Filippova et al., adding syntactic features and using Integer Linear Programming (ILP), yields improved results in a cross-domain setting (Wang et al., 2017) . Sentence selection is used for document extractive summarization -a task conceptually close to ours, in which full sentences are extracted from a long document, see (Nenkova et al., 2011) for an overview. State-of-the-art selection is achieved by combining sentence and document encoders (CNN and LSTM) with a sentence extraction model (LSTM) and a reinforcement layer (Narayan et al., 2018) . Sentence rephrasing is an abstractive approach to rewrite a sentence into a shorter form using some words that may not appear in the original sentence. A data-driven approach to abstractive sentence summarization is suggested in (Rush et al., 2015; Chopra et al., 2016) , using about four million title-article pairs from the Gigaword corpus for training, and uses a convolutional neural network model to encode the source and produce a single representation for the entire input sentence. Tree-to-tree grammar extraction method for the rewriting task is used in Lapata, 2008, 2009 ). State-of-the-art performance on the abstractive summarization task is obtained using Hierarchical Attentional Seq2Seq Recurrent Neural Network (Nallapati et al., 2016; See et al., 2017) . In this section we formally define the sentencelevel and the document-level tasks ( §3.1) and provide a detailed description of the application of B-BOC in both settings ( §3.2). Sentence-Level Compression Given a set of sentences S = {s i } n i=1 ; a desired compression rate γ; the number of sentences to compress k ≤ n; a compression algorithm A; and an oracle R : (A, S) → [0, 1], returning a score reflecting the compression quality (grammaticality and minimal loss of information) A would achieve on s ∈ Swe would like to choose a set S k,γ ⊆ S of k sentences: . We call this sentence-level compression since each sentence should meet the γ constraint independently. It is important to note that γ ≤ γ =⇒ S k,γ ⊆ S k,γ , since different sentences may be better compressed to different γ values. Consider the following two sentences used to illustrate the importance of the Oxford comma: S ={"I had a yummy dinner with my parents, Batman and Catwoman", "I had a yummy dinner with my parents, Batman, and Catwoman"} 1 , and k = 1. The first sentence could be compressed to "I had a yummy dinner with my parents" with a minimal loss of information, while it does not make sense to com-press the second sentence this way and it should be compressed to "I had a yummy dinner", thus specifying k = 1, the sentence to be compressed with minimal loss of meaning depends on the desired γ value. In this setting, we are given a sequence of sentences D = {s i } n i=1 (a paragraph or a full document), and a desired compression rate γ that should be applied to D as a whole. That is, we wish to find an optimal subset of S γ that satisfies: Since γ refers to D rather than to individual sentences, the overall quality can be maximized by choosing a varying number of sentences expected to achieve different optimal compressions. Unlike the sentence-level setting, here, an optimal S γ may contain a combination of sentences, for some of which |A(s)| |s| ≤ γ, and for others |A(s)| |s| > γ. Scoring Function Given a corpus C = { s i ,ŝ i } m i of sentence pairs, each pair contains an original sentence s and its gold compression s, we define the golden ratioγ i = |ŝ i | |s i | , and posit R(A, s) ≈ 1 − |γ i − |A(s i )| |s i | |. We justify the use ofγ i as a proxy to the optimal compression quality, as compression ratios are found to correlate with compression quality measured against gold compressions (Napoles et al., 2011) . The use ofγ i as a proxy is validated through manual evaluation, see Sections 4.2 and 5.1. Syntactic features were successfully used for sentence compression (Clarke and Lapata, 2008; Wang et al., 2017; Futrell and Levy, 2017) . Assuming that sentence complexity correlates with the ease of compression, we follow (Brunato et al., 2018) and represent each sentence as a vector of shallow features (sentence length, average word length, punctuation counts, etc.) and syntactic features (depth of constituent parse tree as well as the number of internal nodes, word's depth in a dependency parse tree, mean dependency distance, etc.). We now train a regression model and learn the scoring function R(A, s) by minimizing the loss: We note that we do not train a compression algorithm, but an oracle -a scoring function that predicts the quality of the compression algorithm A will achieve on a given sentence. This oracle will be used to rank candidate sentences in order to optimize the choice of sentences in the two tasks defined in Section 3.1. We train a Gradient Boosted Tree regression model using XGBoost. The model's hyperparameters (e.g., subsample ratio, learning rate, max depth) were tuned on a separated development set. Sentence level compression: Given a set of sentences S, B-BOC operates on two steps: (i) It applies R on every s ∈ S, producing an ordered set S for which ∀ i