key: cord-0221846-1zws2o1n
authors: Bista, Umanga; Mathews, Alexander Patrick; Menon, Aditya Krishna; Xie, Lexing
title: SupMMD: A Sentence Importance Model for Extractive Summarization using Maximum Mean Discrepancy
date: 2020-10-06
journal: nan
DOI: nan
sha: 2f67e5965875fdedfe42bb1e3e088b08214e089e
doc_id: 221846
cord_uid: 1zws2o1n

Most work on multi-document summarization has focused on generic summarization of information present in each individual document set. However, the under-explored setting of update summarization, where the goal is to identify the new information present in each set, is of equal practical interest (e.g., presenting readers with updates on an evolving news topic). In this work, we present SupMMD, a novel technique for generic and update summarization based on the maximum mean discrepancy from kernel two-sample testing. SupMMD combines both supervised learning for salience and unsupervised learning for coverage and diversity. Further, we adapt multiple kernel learning to make use of similarity across multiple information sources (e.g., text features and knowledge based concepts). We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets.

Multi-document summarization is the problem of producing condensed digests of salient information from multiple sources, such as articles. Concretely, suppose we are given two sets of articles (denoted set A and set B) on a related topic (e.g., climate change, the COVID-19 pandemic), separated by publication timestamp or geographic region. We may then identify three possible instantiations of multi-document summarization (see Figure 1 ):

(i) generic summarization, where the goal is to summarize a set (A or B) individually. (ii) comparative summarization, where the goal is to summarize a set (B) against another set (A) while highlighting the differences. (iii) update summarization, where the goal is both generic summarization of set A and comparative summarization of set B versus A. Most existing work on this topic has focused on the generic summarization task. However, update summarization is of equal practical interest. Intuitively, the comparative aspect of this setting aims to inform a user of new information on a topic they are already familiar with.

Multi-document extractive summarization methods can be unsupervised or supervised. Unsupervised methods typically define salience (or coverage) using a global model of sentence-sentence similarity. Methods based on retrieval (Goldstein et al., 1999) , centroids , graph centrality (Erkan and Radev, 2004) , or utility maximization (Lin and Bilmes, 2010, 2011; ) have been well explored. However, sentence salience also depends on surface features (e.g., position, length, presence of cue words); effectively capturing these requires supervised models specific to the dataset and task. A body of work has incorporated such information through supervised learning, for example based on point processes (Kulesza and Taskar, 2012) , learning important words , graph neural networks (Yasunaga et al., 2017) , and support vector regression (Varma et al., 2009) . These supervised methods have either a separate model for learning and inference, leading to a disconnect between learning sentence salience and sentence selection (Varma et al., 2009; Yasunaga et al., 2017; , or are designed specifically for generic summarization (Kulesza and Taskar, 2012) . In this work, we propose SupMMD, which has a single model of learning salience and inference and can be applied to generic and comparative summarization. We make the following contributions:

(1) We present SupMMD, a novel technique for both generic and update summarization that combines supervised learning for salience and unsupervised learning for coverage and diversity. SupMMD has a single model for learning and inference.

(2) We adapt multiple kernel learning (Cortes et al., 2010) into our model, which allows similarity across multiple information sources (e.g., text features and knowledge based concepts) to be used.

(3) We show that SupMMD meets or exceeds the state-of-the-art in generic and update summarization on the DUC-2004 and TAC-2009 datasets.

Multi-document summarization can be extractive, where salient pieces of the original text such as sentences are selected to form the summary; or abstractive, where a new text is generated by paraphrasing important information. The former is popular as it often creates semantically and grammatically correct summaries (Nallapati et al., 2017) . In this work, we focus on generic and update multidocument summarization in the extractive setting.

Most extractive summarizers have two components: sentence scoring and selection. A variety of unsupervised and supervised methods have been developed for the former. Unsupervised sentence scorers are based on centroids , graph centrality (Erkan and Radev, 2004) , retrieval relevance (Goldstein et al., 1999) , word statistics (Nenkova and Vanderwende, 2005) , topic models (Haghighi and Vanderwende, 2009 ), or concept coverage Lin and Bilmes, 2011) . Supervised techniques include: using a graph-based neural network (Yasunaga et al., 2017) , learning sentence quality from point processes (Kulesza and Taskar, 2012) , combining word importances , combining sentence and phrase importances (Cao et al., 2015) , or employing a mixture of submodular functions (Lin and Bilmes, 2012) .

Sentence selection methods can be broadly categorized as greedy methods (Goldstein et al., 1999; Erkan and Radev, 2004; Nenkova and Vanderwende, 2005; Cao et al., 2015; Haghighi and Vanderwende, 2009; Kulesza and Taskar, 2012; Cao et al., 2015; Varma et al., 2009) , which produce approximate solutions by iteratively selecting the sentences with the maximal score, or exact integer linear programming (ILP) based methods Cao et al., 2015) . Some greedy methods use an objective which belongs to a special class of set functions called submodular functions (Lin and Bilmes, 2010 Kulesza and Taskar, 2012) , which have good approximation guarantees under greedy optimization (Nemhauser et al., 1978) .

There has been limited research into update and comparative summarization. Notable prior work includes maximizing concept coverage using ILP , learning sentence scores using a support vector regressor (Varma et al., 2009) , and temporal content filtering (Zhang et al., 2009) . Bista et al. (2019) cast the comparative summarization problem as classification, and use MMD (Gretton et al., 2012) . In this work, we adapt their method to learn sentence importances driven by surface features.

We review a perspective introduced by Bista et al. (2019) , where summarization is viewed as classification, and provide a brief introduction to Maximum Mean Discrepancy (MMD). Both these ideas form the basis of our subsequent method.

Let {V t } T t=1 be T topics of articles that we wish to summarize. For a topic t, we wish to select summary sentences S t . Bista et al. (2019) formulated summarization as selecting prototypes that minimize the accuracy of a powerful classifier between sentences in the input and summary. The intuition is that a powerful classifier should not be able to distinguish between the sentences from articles and summary sentences. Formally, we pick

where S t ⊂ 2 V t : ∀ S ∈S s∈S len(s) ≤ L comprise subsets of V t with upto L words, and Acc(X,Y ) is the accuracy of the best possible classifier that distinguishes between elements in sets X and Y ; we shall shortly realize this using MMD.

For comparative summarization between two sets A and B, Bista et al. (2019) introduced an additional term into (3.1), giving rise to competing goals for the classifier: it should not be able to distinguish between the summaries and sentences from set B, but should be able to distinguish between the summaries and sentences from set A. Formally, let V t B be the set of sentences in set B, V t A be the sentences in set to compare (set A). Then, for suitable λ > 0, we seek S t , the summary sentences of set B,

The hyperparameter λ controls the relative importance of accurately representing articles in set B, versus not representing the articles in set A.

The MMD is a kernel-based measure of the distance between two distributions. More formally:

Definition 3.1. Let H be a Reproducing Kernel Hilbert Space (RKHS) with associated kernel k. Let F be the set of functions h : X → R in the unit ball of H, where X is a topological space. Then, the MMD between distributions p,q is the maximal difference in expectations of functions from F under p,q (Gretton et al., 2012) :

A small MMD value indicates that p, q are similar. Given finite samples X ∼ p n and Y ∼ q m , an empirical estimate of the MMD, denoted as MMD 2 F (X,Y ), can be computed as:

The MMD corresponds to the minimal achievable loss of a centroi-based kernel classifier (Sriperumbudur et al., 2009 ). Consequently, we use MMD 2 F (V,S) to approximate the Acc(V,S) in (3.1) and (3.2), using a suitable kernel k that measures the similarity of two sentences. Intuitively, this selects summaries S which best represent the distribution of original sentences V .

Note that if we expand MMD 2 F (V,S) as per (3.4) and later in §4.6, the first term is irrelevant for optimization. The second and third term capture the coverage and diversity of the summary sentences without any supervision. Hence, this is an unsupervised summarization.

We start by developing a technique for incorporating sentence importance into MMD for the purpose of generic multi-document extractive summarization. We then extend this method to comparative summarization, and incorporate multiple different kernels to use a diverse sets of features.

Unsupervised MMD (Bista et al., 2019) selects representative sentences that cover relevant concepts while retaining diversity. The notion of representativeness is based on a global model of sentence-sentence similarity; however, this notion of representativeness is not necessarily well matched to the selection of salient information. Salience of a sentence may be determined by surface features such as position in the article, or number of words. For example, news articles are often written such that sentences at the start of a article have the characteristics of a summary (Kedzie et al., 2018) . Learning a notion of salience that is specific to the summarization task and dataset requires supervised training. Thus, we extend the MMD model by incorporating supervised sentence importance weighting.

Let v, s ∈ X be independent samples drawn from the distributions of article sentences p and summary sentences q on the space of all sentences X. We define non-negative importance functions f p θ , f q θ parameterized by learnable parameters θ. We restrict these functions so that E p f p θ (v) = 1 and E q f q θ (s) = 1. Equipped with f θ , we may modify MMD such that the importance of sentences which are good summary candidates is increased.

Note that classic MMD (3.3) is a special case of (4.1) where f θ ≡ 1.

In practice, the supremum over all h is impossible to compute directly. We thus derive an alternative form for Equation 4.1. Lemma 4.1. For h H ≤ 1, (4.1) is equivalently

In the above, φ : X → F is a canonical feature mapping of sentences and summaries from X to RKHS. The derivation, which mirrors a similar derivation for MMD (Gretton et al., 2012) , is given in the Appendix.

We use log-linear models as importance functions, as they are a common choice of sentence importance (Kulesza and Taskar, 2012) and easy to fit when training data is scarce. Formally, the log-linear importance function is:

where nt = |V t | is the number of sentences and mt = |S t | is the number of summary sentences in topic t.

The parameters θ of the log-linear importance function must be learned from data, so we define a loss function based on weighted MMD. Let {(V t ,S t )} T t=1 be the T training tuples. Then, the loss of topic t is the square of importance weighted empirical MMD between sentences and summary sentences from within the topic:

is an empirical estimate of the weighted MMD 2 F (p,q,θ). Applying the kernel trick to Equation 4.4 gives (see Appendix):

.5 is the loss for a single topic but during training we will instead minimize the average loss over all topics in the training set, i.e., min θ 1 T T t=1 L t (V t , S t , θ). Intuitively, we learn the parameters θ by minimizing an importance weighted distance between sentences and ground truth summary sentences over all topics.

We now extend the learning task to comparative summarization using the competing binary classifiers idea of Bista et al. (2019) (cf. §3.2). Specifically, we replace the accuracy terms in Equation 3.2 with the square of weighted MMD. Given the T comparative training tuples

, then the objective is to minimize:

Note there are two sets of importance parameters θ B ,θ A one for each of the two document sets.

We employ Multiple Kernel Learning (MKL) to make use of data from multiple sources in our MMD summarization framework. We adapt two stage kernel learning (Cortes et al., 2010) , where different kernels are linearly combined to maximize the alignment with the target kernel of the classification problem. Since MMD can be interpreted as classifiability (Sriperumbudur et al., 2009 ) MKL fits neatly into our MMD based summarization objective. Intuitively, MKL should identify a good combination of kernels for building a classifier that separates summary and non-summary sentences. Let {k i } p i=1 be p kernel functions. For topic t, let K t i be the kernel matrix according to kernel function k i , and K t i = U nt K t i U nt be the centered kernel matrix, with U nt = I − 11 T /n t . Let y t = {±1} n t be the ground truth summary labels with y t i = +1 iff i ∈ S t . The target kernel y t (y t ) T represents the ideal notion of similarity between sentences. The non-negative kernel weights w which lead to the optimal alignment with the target kernel are given by (Cortes et al., 2010) 

where M t ∈ R p×p has M t rs = K r ,K s F and a t ∈ R p has a i = K t i ,y t (y t ) T F . The kernel function must be characteristic for MMD to be a valid metric (Muandet et al., 2017) . Most popular kernels used for bag of words like text features (including TF-IDF), the linear kernel (k(x, y) = x, y ) and the cosine kernel (k(x,y) = x,y

x y ), are not characteristic (Sriperumbudur et al., 2010) . Fortunately, the exponential kernel, k(x, y) = exp(γk (x,y)), γ > 0, is characteristic for any kernel k (Steinwart, 2001) .

Hence, we use the normalized exponential kernel combined with the cosine kernel, k(x, y) = exp(−γ)exp γ p i=1 w i ·cos x (i) ,y (i) .

Given a learned importance function f θ , we may find the best set of summary sentencesS t for generic summarization via:

Similarly, for the comparative task, with learned importance functions, we seekS t as:

Both these inference problems are budgeted maximization problems, which are often solved by greedy algorithms (Lin and Bilmes, 2010) . The generic unsupervised summarization task is submodular and monotone under certain conditions (Kim et al., 2016) , so greedy algorithms have good theoretical guarantees (Nemhauser et al., 1978) . While our supervised variants do not have these guarantees, we find that greedy optimization nonetheless leads to good solutions.

We include guidance on applying SupMMD and the details required to reproduce our experiments.

We use four standard multi-document summarization benchmark datasets: DUC-2003 , DUC-2004 , TAC-2008 and TAC-2009 

The DUC and TAC datasets are provided as collections of XML documents, so it is necessary 1 https://duc.nist.gov/data.html to extract relevant text and then perform sentence and word tokenization. For DUC we clean the text using various regular expressions the details of which are provided in our code release. We train PunktSentenceTokenizer to detect sentence boundaries, and use the standard NLTK (Bird, 2006) word tokenizer. For the TAC dataset, we use the preprocessing pipeline employed by 2 . This enables a cleaner comparison with the state-of-the-art ICSI ) method on the TAC dataset. For all datasets, we keep the sentences between 8 and 55 words per Yasunaga et al. (2017) .

Our method requires two different sets of sentence features: text features, which are used to compute the sentence-sentence similarity as part of the kernel; and surface features, which are used in learning the sentence importance model.

Each sentence has three different feature representations: unigrams, bigrams and entities. The unigrams are stemmed words, with stop words from the NLTK english list removed. The bigrams are a combination of stemmed unigrams and bigrams. The entities are DBPedia concepts extracted using DBPedia Spotlight (Mendes et al., 2011) . We use a Term Frequency Inverse Sentence Frequency (TF-ISF) (Neto et al., 2000) representation for all text features. TF-ISF has been used extensively in multi-document summarization (Dias et al., 2007; Alguliev et al., 2011; Wan et al., 2007) .

We use 10 surface features for the DUC dataset, and 12 for the TAC dataset: position: There are five position features. Four indicators denote the 1 st , 2 nd , 3 rd or a later position of the sentence in the article. The final feature gives the position relative to the length of the article. counts: There are two count features: the number of words and number of nouns. We use the spaCy 3 part of speech tagging to find nouns. tfisf: This is the sum of the TS-ISF scores for unigrams composing the sentence. For sentence s, this is w∈s isf(w)·tf(w,s), where isf(w) is the inverse sentence frequency of unigram w, and tf(w,s) is the term frequency of w in s. Table 1 : Dataset statistics and oracle performance. We report the number of topics in each dataset, along with the number of sentences after preprocessing. We show the ROUGE scores of our oracle method and the one by Liu and Lapata (2019) with average number of sentence in summary from each method.

btfisf: The boosted sum of TS-ISF scores for unigrams composing the sentence. Specifically, we compute w∈s isf(w)·b(w)·tf(w,s), where we boost the score of unigrams w that appear in the first sentence of the article as b(w). In the generic summarization b(w) = 2, for comparative summarization b(w) = 3, as used by . Unigrams that do not appear in the first sentence of the article have b(w) = 1. lexrank The LexRank score (Erkan and Radev, 2004 ) computed on the bigrams' cosine similarity.

For the TAC datasets, we additionally use: par_start: An indicator whether the sentence begins a paragraph. This is provided by the preprocessing pipeline from ICSI . qsim: The fraction of topic description unigrams present in each sentence; these topic descriptions are only available for TAC.

Both DUC and TAC provide four human written summaries for each topic. Since our goal is extractive summarization with supervised training, we need to know which sentences in the articles could be used to construct the summaries in the training set. The article sentences that best match the abstractive summaries are called the oracles (S t ).

Algorithm 1 Oracle extraction

len(s) r

S t ← S t ∪{s * } return S t Our extraction algorithm (Algorithm 1), is inspired by Liu and Lapata (2019) . We greedily select sentences (s) which provide the maximum gain in extraction score α(S t ,H t ) against the human summaries (H t ) until a word budget (L) is reached. We only include sentences between 8 to 55 words as suggested by Yasunaga et al. (2017) , and set a budget of 104 words to ensure our oracle summaries are within 100±4 words, consistent with the evaluation ( §5.6).

In contrast to Liu and Lapata (2019) which uses only ROUGE-2 recall score (Lin, 2004) , our method balances both ROUGE-1 and ROUGE-2 recall scores using the harmonic mean and explicitly accounts for sentence length. Grid search on the validation sets shows that the optimal value for r is 0.4 across different datasets and summarization tasks. As reported in Table 1 , on average our method produces oracles consisting of more sentences and with higher ROUGE-1 and ROUGE-2 scores compared to oracles from Liu and Lapata (2019) . This is consistent across all datasets.

Supervised variants use an 2 regularized log-linear model of importance ( §4.2) trained using the oracles ( §5.4) as ground truth. We selected the number of training epochs using 5-fold cross validation. We then tune the other hyperparameters on the training set. The hyperparameters of the generic summarization task are: γ, a parameter of the kernel; β, the 2 regularization weight for the log-linear importance function; and r, which defines the length dependent scaling factor in greedy selection (Lin and Bilmes, 2010) . The comparative objective (4.6) has an additional hyperparameter λ, which controls the comparativeness. More implementation details are provided in the Appendix. We will make implementation publicly available 4 .

To evaluate our methods we use the ROUGE (Lin, 2004 ) metric, the de facto choice for evaluating both generic summarization Cho et al., 2019; Yasunaga et al., 2017; Kulesza and Taskar, 2012) , and update summarization (Varma et al., 2009; Zhang et al., 2009; Li et al., 2009 ). ROUGE metrics have been shown to correlate with human judgments (Lin, 2004) in generic summarization task. Our recent work (Bista et al., 2019) shows that human judgments are consistent with the automatic metrics for evaluating comparative summaries.

Both DUC and TAC evaluations use the first 100 words of the generated summary. Our DUC-2004 evaluation setup mirrors . This allows us to compare performance with the state-of-the-art methods they reported and other works also evaluated using this setup 5 . As is standard for the DUC-2004 datasets, we report ROUGE-1 and ROUGE-2 recall scores.

For TAC-2009 datasets (both set A and B) , we adopt the evaluation settings from the TAC-2009 competition 6 so we can compare against the three best performing systems in the competition 7 . As is standard for the TAC-2009 dataset, we report ROUGE-2 and ROUGE-SU4 recall scores.

We select the top performing methods from a recent benchmark paper to serve as baselines and report ROUGE scores from the benchmark paper. They are: ICSI: an integer linear programming method that maximizes coverage , DPP: a determinantal point process method that learns sentence quality and maximizes diversity (Kulesza and Taskar, 2012) , Submodular: a method based on a learned mixture of submodular functions (Lin and Bilmes, 2012) , OCCAMS_V: a method base on topic modeling (Conroy et al., 2013) , Regsum: a method that focuses on learning word importance , Lexrank: a popular graph based sentence scoring method (Erkan and Radev, 2004) .

We also include recent deep learning methods evaluated using the same setup as and report ROUGE scores from the individual papers: DPPSim: an extension to the DPP model which learns the sentence-sentence similarity using a 5 ROUGE 1.5.5 with args -n 4 -m -a -l 100 -x -c 95 -r 1000 -f A -p 0.5 -t 0 6 tac.nist.gov/2009/Summarization 7 args -n 4 -w 1.2 -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 -a -l 100 capsule network (Cho et al., 2019) , HiMAP: a recurrent neural model that employs a modified pointer-generator component (Fabbri et al., 2019) , and GRU+GCN: a model that uses a graph convolution network combined with a recurrent neural network to learn sentence saliency (Yasunaga et al., 2017) .

TAC-2009: As baselines for the TAC-2009 dataset we use the top three systems in the TAC-2009 competition for each task, resulting in four systems altogether. To the best of our knowledge these systems are the current state-of-the-art. We report the ROUGE scores from the competition. The systems are: ICSI: with two variants: Sys.34 uses integer linear programming to maximize coverage of concepts , and Sys.40, which additionally uses sentence compression to generate new candidate sentences, IIT: uses a support vector regressor to predict sentence ROUGE scores (Varma et al., 2009) , ICTCAS: a temporal content filtering method (Zhang et al., 2009) , and ICL: a manifold ranking based method (Li et al., 2009 ).

We compare our methods with the baselines on the DUC-2004 , TAC-2009 -A and TAC-2009 datasets. We present several variants of our method to analyze the effects of different components and modeling choices. We report the performance of unsupervised MMD (UnsupMMD) which does not explicitly consider sentence importance. For our supervised method SupMMD, we report the performance with a bigram kernel (SupMMD) and combined kernels (SupMMD + MKL). We also evaluated the impact of our oracle extraction method by replacing it with the extraction method suggested by Liu and Lapata (2019) in SupMMD + alt oracles. Meanwhile, SupMMD + MKL + compress presents the result of applying sentence compression to our model.

The performance of our methods on the DUC-2004 generic summarization task are shown in Table 2 . On the DUC-2004 dataset all SupMMD variants exceed the state-of-the-art, when evaluated with ROUGE-2, and perform similarly to the best existing methods when evaluated with ROUGE-1.

ICSI 38.41 9.78 DPP (Kulesza and Taskar, 2012) 39.79 9.62 Submodular (Lin and Bilmes, 2012) 39.18 9.35 OCCAMS_V (Conroy et al., 2013) 38.50 9.76 Regsum 38.57 9.75 Lexrank Erkan and Radev (2004) 35.95 7.47 DPP-Sim (Cho et al., 2019) 39.35 10.14 HiMAP (Fabbri et al., 2019) 35.78 8.90 GRU+GCN (Yasunaga et al., 2017) Our best system SupMMD + MKL outperforms the previous best system (ICSI) on ROUGE-2 score by +3.9%. While the DPP baseline achieves the highest ROUGE-1 score on DUC-2004, it has a relatively low ROUGE-2 score which suggests it is optimized for unigram performance at the cost of bigram performance. SupMMD + MKL strikes a better balance, scoring the best in ROUGE-2 and second best in ROUGE-1. On the TAC-2009 generic summarization task in Table 3 our SupMMD + MKL model outperforms the state-of-the-art ICSI model on both ROUGE-2 and ROUGE-SU4. Specifically, SupMMD + MKL scores 12.33 in ROUGE-2 while the best ICSI variant scores 12.16 in ROUGE-2. Supervised Modeling: Models using supervised training to identify important sentences substantially outperform the unsupervised method Un-supMMD. In fact, UnsupMMD is the lowest scoring method across all metrics and datasets. This strongly indicates that a degree of supervision is essential to perform well in this task, and that the importance function is a suitable way to adapt the UnsupMMD model to supervised training. Moreover, we observe a strong correlation between the the relative position of a sentence and the score given by SupMMD. This observation is consistent with previous works (Kedzie et al., 2018) , and demonstrates that SupMMD has learned to use the surface features to capture salience. Further details of feature correlations are provided in the Appendix.

Oracle extraction: Our oracle extraction technique for transforming abstractive training data to extractive training data helps SupMMD methods achieve higher ROUGE performance. An alternative technique developed by Liu and Lapata (2019) and implemented in SupMMD (alt oracle) gives lower performance than our technique. For example, on DUC-2004 SupMMD (alt oracle) has a ROUGE-1 of 39.02 and ROUGE-2 of 10.22, while SupMMD has a ROUGE-1 of 39.36 and a ROUGE-2 of 10.31. Thus, the advantages of our proposed oracle extraction method are substantial and consistent across multiple datasets and evaluation metrics. Multiple Kernel Learning: We observe that combining multiple kernels helps the performance of SupMMD models on the generic summarization task. SupMMD + MKL which combines both bigram and entity kernels has a ROUGE-2 of 10.54 on DUC-2004, while SupMMD only uses the bigrams kernel and scores 10.31 in ROUGE-2. Multiple kernels show even clearer gains in the TAC-2009-A dataset.

Sentence compression incorporated into the post-processing steps of SupMMD + MKL + compress does not clearly improve the results over SupMMD + MKL. On TAC-2009-A, compression clearly reduces performance, and on DUC-2004 SupMMD + MKL + compress has a higher ROUGE-1 score but a lower ROUGE-2 score than SupMMD + MKL. Incorporating compression into the summarization pipeline is an appealing direction for future work.

The results for the comparative summarization task on the TAC-2009-B dataset are shown in Table 4 . Our supervised MMD variants SupMMD and Sup-MMD + MKL both outperform the state-of-the-art baseline ICSI in ROUGE-SU4 but fall short in ROUGE-2. It would be hard to claim that either method is superior in this instance; however, it does show that SupMMD -which uses a substantially different approach to that of ICSI -provides an alternative state-of-the-art. Thus SupMMD further maps out the set of techniques that are useful for comparative summarization. As per the generic summarization task, both our supervised training method and oracle extraction method are essential for achieving good performance in ROUGE-2 and ROUGE-SU4. We also identify sentence position and btfisf as important features for sentence salience (see the Appendix).

Multiple kernels as in SupMMD + MKL has relatively little effect, reducing the ROUGE-2 score to 10.24 from the slightly higher 10.28 achieved by SupMMD. A similar small decrease is seen for ROUGE-SU4. Manual inspection shows that the summaries from SupMMD and SupMMD + MKL methods are largely identical with differences primarily on topic D0908, which covers political movements in Nepal. The key entities in this topic are not resolved accurately by DBPedia Spotlight, contributing additional noise and affecting the MKL approach.

Model variants: We have tested an additional variant of our model for comparative summarization, SupMMD 2 , which defines two different importance functions: one for each of the two document sets -A and B (See §4.4 for details). In contrast, SupMMD has a single importance function shared between document sets, i.e., in Equation (4.6), θ A = θ B . SupMMD 2 performed substantially worse than SupMMD in both metrics, for example, SupMMD has a ROUGE-2 of 10.28 while SupMMD 2 has a ROUGE-2 of 9.94. We conjecture that a single importance function performs better when training data is relatively scarce because it reduces the number of parameters and simplifies the learning problem. Techniques for tying together the parameters for both importance functions, such as with a hierarchical Bayesian model, are left as future work.

In this work, we present SupMMD, a novel technique for update summarization based on the maximum mean discrepancy. SupMMD combines supervised learning for salience, and unsupervised learning for coverage and diversity. Further, we adapt multiple kernel learning to exploit multiple sources of similarity (e.g., text features and knowledge based concepts). We show the efficacy of SupMMD in both generic and update summarization tasks on two standard datasets, when compared to the existing approaches. We also show that the importance model we introduce on top of our existing unsupervised MMD (Bista et al., 2019) improves the summarization performance substantially on both generic and comparative summarization tasks.

For future work, we leave the task of incorporating embeddings features such as BERT (Devlin et al., 2019) , and evaluating with large generic multi-document summarization dataset Multi-News (Fabbri et al., 2019) . Jin Zhang, Pan Du, Hongbo Xu, and Xueqi Cheng. 2009 . Ictgrasper at TAC2009: temporal preferred update summarization. In Proceedings of the Second Text Analysis Conference, TAC 2009 , Gaithersburg, Maryland, USA, November 16-17, 2009 .

In this section, we provide a brief overview of kernels and Maximum mean Discrepancy (MMD). For a detailed overview, we refer readers to to Muandet et al. (2017) and Gretton et al. (2012) from which this brief overview is taken.

Definition A.1. A function k : X × X → R is called positive definite kernel if it is symmetric, i.e. ∀ x,y∈X k(x, y) = k(y, x) and gram matrix is positive definite, i.e. ∀ n∈N ∀ c 1 ,c 2 ,..cn∈R

This is known as the kernel trick in machine learning. The feature space H is called a Reproducing Kernel Hilbert Space (RKHS), and the kernel k is also known as reproducing kernel. In an RKHS, function evaluation h(x) = h, φ(x) H , where φ : X → H are canonical feature map associated with RKHS H, and φ(x) = k(.,x). A RKHS is fully characterized by its reproducing kernel k, or a positive definite kernel k uniquely determines a RKHS and vice versa .

which is known as the Riesz representer theorem (Conway, 1990) .

Recall that F is a class of RKHS functions within the unit ball, i.e. h ∈ H, h H ≤ 1. Suppose H admits a feature map φ : X → H. Then, per Gretton et al. (2012) , we may solve the supremum in Equation 3.3 as

Hence, MMD is computed as the distance between the mean feature embeddings under each distribution, for a suitable kernel-based feature space (Gretton et al., 2012) .

Eq. (A.1) involves explicitly evaluating the arbitrarily high-dimensional features. Instead, the kernel trick allows efficient computation of MMD 2 F (p, q) by evaluating just pairwise kernels. Supposing H has induced kernel k, we have

For a distribution p, and kernel with feature map φ : X → H, the kernel mean map is

A kernel k is characteristic if the map µ : p → µ p is injective. A characteristic kernel ensures MMD is 0 if and only if p = q, i.e., no information is lost in mapping the distribution into the RKHS (Muandet et al., 2017) . Examples of characteristic kernels for R d include the Gaussian kernel ( k(x, y) = exp −γ x−y 2 2 , γ > 0 ), and Laplace kernel ( k(x,y) = exp(−γ x−y 1 ), γ > 0 ) . MMD with the Gaussian kernel is equivalent to comparing all moments between two distributions .

We train generic summarization model with full batch LBFGS (Liu and Nocedal, 1989) with learning rate 0.005. We train comparative summarization model using Yogi optimizer (Zaheer et al., 8 https://en.wikipedia.org/wiki/Dual_norm 2018), with a mini batch size of 8 topics, learning rate 0.002, and decreasing the learning rate by half every 20 epochs. We choose the number of training epochs by validating across 5 folds with early stopping. We set the patience to 20 epochs for early stopping with LBFGS optimizer and 50 epochs with Yogi optimizer. We tune the other hyperparameters on the training set, and the optimal hyperparameters of best model (SupMMD + MKL) and searched space are shown in Table 5 . The kernel combination weights w ( §4.5) are also shown in 

In this section we provide some additional results. We analyze the correlation between normalized ROUGE recall scores of the sentences and sentence scores from SupMMD and Lexrank. The normalized rouge score of each sentence is defined as ROUGE(s) = ROUGE(s) #words(s) . As shown in Table 8 , we find that SupMMD has a slightly high correlation with sentence rouge scores. This suggests that SupMMD is better in capturing sentence importance for summarization.

We analyze the correlation between various surface features and sentence importance scores from SupMMD and Lexrank (Erkan and Radev, 2004 ICSI A fourth day of thrashing thunderstorms began to take a heavier toll on southern California with at least three deaths blamed on the rain, as flooding and mudslides forced road closures and emergency crews carried out harrowing rescue operations. Downtown Los Angeles has had more than 15 inches of rain since Jan. 1, more than its average rainfall for an entire year, including 2.6 inches, a record. Meteorologists say Southern California has not been hit by this much rain in nearly 40 years. The disaster was the latest caused by rain and snow that has battered California since Dec. 25.

Californians braced for even more rain as they struggled to recover from storms that have left at least nine people dead, triggered mudslides and tornadoes, and washed away roads and runways. The record, 38.18 inches (96.98 centimeters), was set in 1883-1884.

Mudslides forced Amtrak to suspend train service between Los Angeles and Santa Barbara through at least Thursday. A winter storm pummeled Southern California for the third straight day, claiming the lives of three people and raising fears of mudslides, even as homes around the region were evacuated. Staff Writers Rick Orlov and Lisa Mascaro contributed to this story.

SupMMD Downtown Los Angeles has had more than 15 inches of rain since Jan. 1, more than its average rainfall for an entire year, including 2.6 inches, a record. A fourth day of thrashing thunderstorms began to take a heavier toll on southern California with at least three deaths blamed on the rain, as flooding and mudslides forced road closures and emergency crews carried out harrowing rescue operations. The roads in Los Angeles County were equally frustrating. Part of a rain-saturated hillside gave way, sending a Mississippi-like torrent of earth and trees onto four blocks of this oceanfront town and killing two men.

Storms have caused $52.5 million (euro39.8 million) in damage to Los Angeles County roads and facilities since the beginning of the year. Multi-million-dollar homes collapsed and mudslides trapped residents in their homes as a heavy rains that have claimed three lives pelted Los Angeles for the fifth straight day. In scenes reminiscent of the aftermath of the Northridge Earthquake 11 years ago this month, Los Angeles area residents faced gridlocked freeways and roads Wednesday while cleanup crews cleared mud, rubble and debris left from a two-week siege of rain. A recordshattering storm slammed Southern California for a sixth straight day Tuesday, triggering mudslides and tornadoes and forcing more road closures, but forecasters predicted it would wane Wednesday before a new storm moves in Sunday night. shown in table 6, SupMMD has higher correlation with relative position, signifying the importance of position of sentence in summary sentences. Lexrank has a higher correlations with the number of words, number of nouns and TFISF scores of the sentences, which is expected as Lexrank is an eigenvector centrality of sentence-sentence similarity matrix. This suggest SupMMD is able to learn that first few sentences are important in news summarization. Similar result is reported by Kedzie et al. (2018) , where they show that the first few sentences are important in creating summary of news articles.

We present the update summaries (Set A and B) of topic D0906, which contains articles about "Rains and mudslides in Southern California" in Table   7 . We highlight few phrases in bold which could help us to identify the difference between set A and B. Summaries from ICSI and SupMMD methods suggest that set A contains articles describing events from earlier days of the disaster and set B contains articles from later stage of the disaster.

Mcmr: Maximum coverage and minimum redundant text summarization model

NLTK: The Natural Language Toolkit

Comparative document summarisation via classification

Ranking with recursive neural networks and its application to multi-document summarization

Improving the similarity measure of determinantal point processes for extractive multi-document summarization

Multilingual summarization: Dimensionality reduction and a step towards optimal term coverage

A course in functional analysis

Two-stage learning kernel algorithms

BERT: Pre-training of deep bidirectional transformers for language understanding

Topic segmentation algorithms for text summarization and passage retrieval: An exhaustive evaluation

Lexrank: Graph-based lexical centrality as salience in text summarization

Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model

Annual Meeting of the Association for Computational Linguistics

A scalable global model for summarization

The icsi/utd summarization system at tac

Summarizing text documents: Sentence selection and evaluation metrics

A kernel two-sample test

Exploring content models for multi-document summarization

A repository of state of the art and competitive baseline summaries for generic news summarization

Improving the estimation of word importance for news multidocument summarization

Content selection in deep learning models of summarization

Examples are not enough, learn to criticize! criticism for interpretability

Determinantal Point Processes for Machine Learning

Tac 2009 update summarization of icl

Generative moment matching networks

ROUGE: A package for automatic evaluation of summaries

Multi-document summarization via budgeted maximization of submodular functions

A class of submodular functions for document summarization

Learning mixtures of submodular shells with application to document summarization

On the limited memory bfgs method for large scale optimization

Text summarization with pretrained encoders

Dbpedia spotlight: Shedding light on the web of documents

Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning

Summarunner: A recurrent neural network based sequence model for extractive summarization of documents

An analysis of approximations for maximizing submodular set functions-i

The impact of frequency on summarization

Document clustering and text summarization

Weighted distributions and size-biased sampling with applications to wildlife populations and human families

Centroid-based summarization of multiple documents

Kernel choice and classifiability for rkhs embeddings of probability distributions

Hilbert space embeddings and metrics on probability measures

On the influence of the kernel on the consistency of support vector machines

TAC

Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction

Graph-based neural multi-document summarization

Adaptive methods for nonconvex optimization

This work is supported in part by Data to Decisions CRC and ARC Discovery Project DP180101985. This research is also supported by use of the NeCTAR Research Cloud, a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy. We thank Minjeong Shin for helpful feedback and suggestions.