Transactions of the Association for Computational Linguistics, 1 (2013) 89–98. Action Editor: Noah Smith.
Submitted 12/2012; Published 5/2013. c©2013 Association for Computational Linguistics.

A Novel Feature-based Bayesian Model for Query Focused Multi-document
Summarization

Jiwei Li
School of Computer Science
Carnegie Mellon University
bdlijiwei@gmail.com

Sujian Li
Laboratory of Computational Linguistics

Peking University
lisujian@pku.edu.cn

Abstract

Supervised learning methods and LDA based topic
model have been successfully applied in the field of
multi-document summarization. In this paper, we
propose a novel supervised approach that can in-
corporate rich sentence features into Bayesian topic
models in a principled way, thus taking advantages of
both topic model and feature based supervised learn-
ing methods. Experimental results on DUC2007,
TAC2008 and TAC2009 demonstrate the effective-
ness of our approach.

1 Introduction

Query-focused multi-document summarization
(Nenkova et al., 2006; Wan et al., 2007; Ouyang et
al., 2010) can facilitate users to grasp the main idea
of documents. In query-focused summarization, a
specific topic description, such as a query, which
expresses the most important topic information is
proposed before the document collection, and a
summary would be generated according to the given
topic.

Supervised models have been widely used in sum-
marization (Li, et al., 2009, Shen et al., 2007,
Ouyang et al., 2010). Supervised models usually re-
gard summarization as a classification or regression
problem and use various sentence features to build a
classifier based on labeled negative or positive sam-
ples. However, existing supervised approaches sel-
dom exploit the intrinsic structure among sentences.
This disadvantage usually gives rise to serious prob-
lems such as unbalance and low recall in summaries.

Recently, LDA-based (Blei et al., 2003) Bayesian
topic models have widely been applied in multi-
document summarization in that Bayesian ap-
proaches can offer clear and rigorous probabilis-
tic interpretations for summaries(Daume and Marcu,

2006; Haghighi and Vanderwende, 2009; Jin et al.,
2010; Mason and Charniak, 2011; Delort and Alfon-
seca, 2012). Exiting Bayesian approaches label sen-
tences or words with topics and sentences which are
closely related with query or can highly generalize
documents are selected into summaries. However,
LDA topic model suffers from the intrinsic disad-
vantages that it only uses word frequency for topic
modeling and can not use useful text features such as
position, word order etc (Zhu and Xing, 2010). For
example, the first sentence in a document may be
more important for summary since it is more likely
to give a global generalization about the document.
It is hard for LDA model to consider such informa-
tion, making useful information lost.

It naturally comes to our minds that we can im-
prove summarization performance by making full
use of both useful text features and the latent seman-
tic structures from by LDA topic model. One related
work is from Celikyilmaz and Hakkani-Tur (2010).
They built a hierarchical topic model called Hybh-
sum based on LDA for topic discovery and assumed
this model can produce appropriate scores for sen-
tence evaluation. Then the scores are used for tun-
ing the weights of various features that helpful for
summary generation. Their work made a good step
of combining topic model with feature based super-
vised learning. However, what their approach con-
fuses us is that whether a topic model only based
on word frequency is good enough to generate an
appropriate sentence score for regression. Actually,
how to incorporate features into LDA topic model
has been a open problem. Supervised topic models
such as sLDA(Blei and MacAuliffe 2007) give us
some inspiration. In sLDA, each document is asso-
ciated with a labeled feature and sLDA can integrate
such feature into LDA for topic modeling in a prin-

89


cipled way.
With reference to the work of supervised LDA

models, in this paper, we propose a novel sentence
feature based Bayesian model S-sLDA for multi-
document summarization. Our approach can natu-
rally combine feature based supervised methods and
topic models. The most important and challeng-
ing problem in our model is the tuning of feature
weights. To solve this problem, we transform the
problem of finding optimum feature weights into an
optimization algorithm and learn these weights in
a supervised way. A set of experiments are con-
ducted based on the benchmark data of DUC2007,
TAC2008 and TAC2009, and experimental results
show the effectiveness of our model.

The rest of the paper is organized as follows. Sec-
tion 2 describes some background and related works.
Section 3 describes our details of S-sLDA model.
Section 4 demonstrates details of our approaches,
including learning, inference and summary gener-
ation. Section 5 provides experiments results and
Section 6 concludes the paper.

2 Related Work

A variety of approaches have been proposed
for query-focused multi-document summarizations
such as unsupervised (semi-supervised) approaches,
supervised approaches, and Bayesian approaches.

Unsupervised (semi-supervised) approaches such
as Lexrank (Erkan and Radex, 2004), manifold
(Wan et al., 2007) treat summarization as a graph-
based ranking problem. The relatedness between
the query and each sentence is achieved by impos-
ing querys influence on each sentence along with
the propagation of graph. Most supervised ap-
proaches regard summarization task as a sentence
level two class classification problem. Supervised
machine learning methods such as Support Vector
Machine(SVM) (Li, et al., 2009), Maximum En-
tropy (Osborne, 2002) , Conditional Random Field
(Shen et al., 2007) and regression models (Ouyang
et al., 2010) have been adopted to leverage the rich
sentence features for summarization.

Recently, Bayesian topic models have shown their
power in summarization for its clear probabilistic
interpretation. Daume and Marcu (2006) proposed
Bayesum model for sentence extraction based on

query expansion concept in information retrieval.
Haghighi and Vanderwende (2009) proposed topic-
sum and hiersum which use a LDA-like topic model
and assign each sentence a distribution over back-
ground topic, doc-specific topic and content topics.
Celikyilmaz and Hakkani-Tur (2010) made a good
step in combining topic model with supervised fea-
ture based regression for sentence scoring in sum-
marization. In their model, the score of training
sentences are firstly got through a novel hierarchi-
cal topic model. Then a featured based support vec-
tor regression (SVR) is used for sentence score pre-
diction. The problem of Celikyilmaz and Hakkani-
Turs model is that topic model and feature based re-
gression are two separate processes and the score of
training sentences may be biased because their topic
model only consider word frequency and fail to con-
sider other important features. Supervised feature
based topic models have been proposed in recent
years to incorporate different kinds of features into
LDA model. Blei (2007) proposed sLDA for doc-
ument response pairs and Daniel et al. (2009) pro-
posed Labeled LDA by defining a one to one corre-
spondence between latent topic and user tags. Zhu
and Xing (2010) proposed conditional topic random
field (CTRF) which addresses feature and indepen-
dent limitation in LDA.

3 Model description

3.1 LDA and sLDA

The hierarchical Bayesian LDA (Blei et al., 2003)
models the probability of a corpus on hidden topics
as shown in Figure 1(a). Let K be the number of
topics , M be the number of documents in the cor-
pus and V be vocabulary size. The topic distribution
of each document θm is drawn from a prior Dirichlet
distribution Dir(α), and each document word wmn
is sampled from a topic-word distribution φz spec-
ified by a drawn from the topic-document distribu-
tion θm. β is a K×M dimensional matrix and each
βk is a distribution over the V terms. The generat-
ing procedure of LDA is illustrated in Figure 2. θm
is a mixture proportion over topics of document m
and zmn is a K dimensional variable that presents
the topic assignment distribution of different words.

Supervised LDA (sLDA) (Blei and McAuliffe
2007) is a document feature based model and intro-

90


Figure 1: Graphical models for (a) LDA model and (b)
sLDA model.

1. Draw a document proportion vector θm|α ∼ Dir(α)
2. For each word in m

(a)draw topic assignment zmn|θ ∼ Multi(θzmn )
(b)draw word wmn|zmn,β ∼ Multi(βzmn )

Figure 2: Generation process for LDA

duces a response variable to each document for topic
discovering, as shown in Figure 1(b). In the gener-
ative procedure of sLDA, the document pairwise la-
bel is draw from y|−→zm,η,δ2 ∼ p(y|−→zm,η,δ2), where−→zm = 1N

∑N
n=1 zm,n.

3.2 Problem Formulation
Here we firstly give a standard formulation of the
task. Let K be the number of topics, V be the vo-
cabulary size and M be the number of documents.
Each document Dm is represented with a collection
of sentence Dm = {Ss}s=Nms=1 where Nm denotes
the number of sentences in mth document. Each
sentence is represented with a collection of words
{wmsn}n=Nmsn=1 where Nms denotes the number of
words in current sentence.

−−→
Yms denotes the feature

vector of current sentence and we assume that these
features are independent.

3.3 S-sLDA
zms is the hidden variable indicating the topic of
current sentence. In S-sLDA, we make an assump-
tion that words in the same sentence are generated
from the same topic which was proposed by Gruber
(2007). zmsn denotes the topic assignment of cur-
rent word. According to our assumption, zmsn =

Figure 3: Graph model for S-sLDA model

1. Draw a document proportion vector θm|α ∼ Dir(α)
2. For each sentence in m

(a)draw topic assignment zms|θ ∼ Multi(θzmn )
(b)draw feature vector

−−→
Yms|zms,η ∼ p(

−−→
Yms|zms,η)

(c)for each word wmsn in current sentence
draw wmsn|zms,β ∼ Multi(βzms )

Figure 4: generation process for S-sLDA

zms for any n ∈ [1,Nms]. The generative approach
of S-sLDA is shown in Figure 3 and Figure 4. We
can see that the generative process involves not only
the words within current sentence, but also a series
of sentence features. The mixture weights over fea-
tures in S-sLDA are defined with a generalized lin-
ear model (GLM).

p(
−−→
Yms|zms,η) =

exp(zTmsη)
−−→
Yms

∑
zms

exp(zTmsη)
−−→
Yms

(1)

Here we assume that each sentence has T features
and
−−→
Yms is a T × 1 dimensional vector. η is a

K × T weight matrix of each feature upon topics,
which largely controls the feature generation proce-
dure. Unlike s-LDA where η is a latent variable esti-
mated from the maximum likelihood estimation al-
gorithm, in S-sLDA the value of η is trained through
a supervised algorithm which will be illustrated in
detail in Section 3.

3.4 Posterior Inference and Estimation

Given a document and labels for each sentence, the
posterior distribution of the latent variables is:

p(θ,z1:N|w1:N,Y,α,β1:K,η) =
∏
m p(θm|α)

∏
s[p(zms|θm)p(

−−→
Yms|zms,η)

∏
n p(wmsn|zmsn,βzmsn ]∫

dθp(θm|α)
∑
z

∏
s[p(zms|θm)p(

−−→
Yms|zms,η)

∏
n p(wmsn|βzmsn )]

(2)
Eqn. (2) cannot be efficiently computed. By
applying the Jensens inequality, we obtain a
lower bound of the log likelihood of document
p(θ,z1:N|w1:N,

−−→
Yms,α,β1:K,η) ≥ L, where

L =
∑

ms

E[logP(zms|θ)] +
∑

ms

E[logP(
−−→
Yms|zms,η)]+

∑

m

E[logP (θ|α)] +
∑

msn

E[logP(wmsn|zms,β)] + H(q)

(3)

91


where H(q) = −E[logq] and it is the entropy of
variational distribution q is defined as

q(θ,z|γ,φ) =
∏

mk

q(θm|γ)
∏

sn

q(zmsn|φms) (4)

here γ a K-dimensional Dirichlet parameter vector
and multinomial parameters. The first, third and
forth terms of Eqn. (3) are identical to the corre-
sponding terms for unsupervised LDA (Blei et al.,
2003). The second term is the expectation of log
probability of features given the latent topic assign-
ments.

E[logP(
−−→
Yms|zms,η)] =
E(zms)

Tη
−−→
Yms − log

∑

zms

exp(zTmsη
−−→
Yms)

(5)
where E(zms)T is a 1 × K dimensional vector
[φmsk]

k=K
k=1 . The Bayes estimation for S-sLDA

model can be got via a variational EM algorithm. In
EM procedure, the lower bound is firstly minimized
with respect to γ and φ, and then minimized with α
and β by fixing γ and φ.
E-step:
The updating of Dirichlet parameter γ is identical
to that of unsupervised LDA, and does not involve
feature vector

−−→
Yms.

γnewm ← α +
∑

s∈m
φs (6)

φnewsk ∝ exp{E[logθm|γ] +
Nms∑

n=1

E[log(wmsn|β1:K)]+

T∑

t=1

ηktYst} = exp[Ψ(γmk) − Ψ(
K∑

k=1

γmk) +

T∑

t=1

ηktYst]

(7)
where Ψ(·) denotes the log Γ function. ms denotes
the document that current sentence comes from and
Yst denotes the tth feature of sentence s.
M-step:

The M-step for updating β is the same as the pro-
cedure in unsupervised LDA, where the probability
of a word generated from a topic is proportional to
the number of times this word assigned to the topic.

βnewkw =
M∑

m=1

Nm∑

s=1

Nms∑

n=1

1(wmsn = w)φ
k
ms (8)

4 Our Approach

4.1 Learning
In this subsection, we describe how we learn the fea-
ture weight η in a supervised way. The learning pro-
cess of η is a supervised algorithm combined with
variational inference of S-sLDA. Given a topic de-
scription Q1 and a collection of training sentences S
from related documents, human assessors assign a
score v(v = −2,−1, 0, 1, 1) to each sentence in S.
The score is an integer between −2 (the least desired
summary sentences) and +2 (the most desired sum-
mary sentences), and score 0 denotes neutral atti-
tude. Ov = {ov1,ov2, ...,vvk}(v = −2,−1, 0, 1, 2)
is the set containing sentences with score v. Let φQk
denote the probability that query is generated from
topic k. Since query does not belong to any docu-
ment, we use the following strategy to leverage φQk

φQk =
∏

w∈Q
βkw·

1

M

M∑

m=1

exp[Ψ(γmk)−Ψ(
K∑

k=1

γmk)]

(9)
In Equ.(9),

∏
w∈Q βkw denotes the probability that

all terms in query are generated from topic k
and 1

M

∑M
m=1 exp[Ψ(γmk)−Ψ(

∑K
k=1 γmk)] can be

seen as the average probability that all documents in
the corpus are talking about topic k. Eqn. (9) is
based on the assumption that query topic is relevant
to the main topic discussed by the document corpus.
This is a reasonable assumption and most previous
LDA summarization models are based on similar as-
sumptions.

Next, we define φOv,k for sentence set Ov, which
can be interpreted as the probability that all sen-
tences in collection Ov are generated from topic k.

φOv,k =
1

|Ov|
∑

s∈Ov
φsk,k ∈ [1,K],v ∈ [−2, 2] (10)

|Ov| denotes the number of sentences in set Ov. In-
spired by the idea that desired summary sentences
would be more semantically related with the query,
we transform problem of finding optimum η to the
following optimization problem:

minηL(η) =

v=2∑

v=−2
v ·KL(Ov||Q);

T∑

t=1

ηkt = 1 (11)

1We select multiple queries and their related sentences for
training

92


where KL(Ov||Q) is the Kullback-Leibler diver-
gence between the topic and sentence set Ov as
shown in Eqn.(12).

KL(Ov||Q) =
K∑

k=1

φOvklog
φOvk
φQk

(12)

In Eqn. (11), we can see that O2, which contain de-
sirable sentences, would be given the largest penalty
for its KL divergence from Query. The case is just
opposite for undesired set.

Our idea is to incorporate the minimization pro-
cess of Eqn.(11) into variational inference process
of S-sLDA model. Here we perform gradient based
optimization method to minimize Eqn.(11). Firstly,
we derive the gradient of L(η) with respect to η.

∂L(η)

ηxy
=

v=2∑

v=−2
v · ∂KL(Qv||Q)

∂ηxy
(13)

∂KL(Qv||Q)
∂ηxy

=
K∑

k=1

1

|Qv|
(1 + log

∑
s∈Qv
|Qv|

)
∑

s∈Qv

∂φsk
∂ηxy

−
K∑

k=1

1

|Qv|
∑

s∈Qv

∂Qsk
ηxy

−
K∑

k=1

1

Qv

∑
s∈Qvφsk
φQk

∂φsk
∂ηxy

(14)
For simplification, we regard β and γ as constant
during updating process of η, so

∂φQk
∂ηxy

= 0.2 We can
further get first derivative for each labeled sentence.

∂φsk
ηxy

∝





Ysyexp[Ψ(γmsi) − Ψ(
K∑

k=1

γmsk) +
T∑

t=1

ηktYsy]

×
∏

w∈s
βkw if k = x

0 if k 6= x
(15)

4.2 Feature Space

Lots of features have been proven to be useful for
summarization (Louis et al., 2010). Here we dis-
cuss several types of features which are adopted in
S-sLDA model. The feature values are either binary
or normalized to the interval [0,1]. The following
features are used in S-sLDA:
Cosine Similarity with query: Cosine similarity is
based on the tf-idf value of terms.

2This is reasonable because the influence of γ and β have
been embodied in φ during each iteration.

Local Inner-document Degree Order: Local Inner
document Degree Order is a binary feature which
indicates whether Inner-document Degree (IDD) of
sentence s is the largest among its neighbors. IDD
means the edge number between s and other sen-
tences in the same document.
Document Specific Word: 1 if a sentence contains
document specific word, 0 otherwise.
Average Unigram Probability (Nenkova and Van-
derwende, 2005; Celikyilmaz and Hakkani-Tur
2010): As for sentence s, p(s) =

∑
w∈s

1
|s|pD(w),

where pD(w) is the observed unigram probability in
document collection.

In addition, we also use the commonly used fea-
tures including sentence position, paragraph po-
sition, sentence length and sentence bigram fre-
quency.

E-step
initialize φ0sk := 1/K for all i and s.
initialize γmi := αmi + N)m/K for all i.
initialize ηkt = 0 for all k and t.
while not convergence
for m = 1 : M

update γt+1m according to Eqn.(6)
for s = 1 : Nm

for k = 1 : K
update φt+1sk according to Eqn.(7)

normalize the sum of φt+1sk to 1.
Minimize L(η) according to Eqn.(11)-(15).
M-step:

update β according to Eqn.(8)

Figure 5: Learning process of η in S-sLDA

4.3 Sentence Selection Strategy
Next we explain our sentence selection strategy. Ac-
cording to our intuition that the desired summary
should have a small KL divergence with query, we
propose a function to score a set of sentences Sum.
We use a decreasing logistic function ζ(x) = 1/(1+
ex) to refine the score to the range of (0,1).

Score(Sum) = ζ(KL(sum||Q)) (16)
Let Sum? denote the optimum update summary. We
can get Sum? by maximizing the scoring function.

Sum? = arg max
Sum∈S&&words(Sum)≤L

Score(Sum)

(17)

93


1. Learning: Given labeled set Ov, learn the feature
weight vector η using algorithm in Figure 5.

2. Given new data set and η, use algorithm in section
3.3 for inference. (The only difference between
this step and step (1) is that in this step we do not
need minimize L(η).

3. Select sentences for summarization from algo-
rithm in Figure 6.

Figure 6: Summarization Generation by S-sLDA.

A greedy algorithm is applied by adding sentence
one by one to obtain Sum?. We use G to denote
the sentence set containing selected sentences. The
algorithm first initializes G to Φ and X to SU. Dur-
ing each iteration, we select one sentence from X
which maximize Score(sm ∪G). To avoid topic re-
dundancy in the summary, we also revise the MMR
strategy (Goldstein et al., 1999; Ouyang et al., 2007)
in the process of sentence selection. For each sm,
we compute the semantic similarity between sm and
each sentence st in set Y in Eqn.(18).

cos−sem(sm,st) =
∑

k φsmkφstk√∑
k φ

2
smk

√∑
k φ

2
stk

(18)

We need to assure that the value of semantic similar-
ity between two sentences is less than Thsem. The
whole procedure for summarization using S-sLDA
model is illustrated in Figure 6. Thsem is set to 0.5
in the experiments.

5 Experiments

5.1 Experiments Set-up
The query-focused multi-document summarization
task defined in DUC3(Document Understanding
Conference) and TAC4(Text Analysis Conference)
evaluations requires generating a concise and well
organized summary for a collection of related news
documents according to a given query which de-
scribes the users information need. The query
usually consists of a title and one or more narra-
tive/question sentences. The system-generated sum-
maries for DUC and TAC are respectively limited to

3http://duc.nist.gov/.
4http://www.nist.gov/tac/.

250 words and 100 words. Our experiment data is
composed of DUC 2007, TAC5 2008 and TAC 2009
data which have 45, 48 and 44 collections respec-
tively. In our experiments, DUC 2007 data is used
as training data and TAC (2008-2009) data is used
as the test data.

Stop-words in both documents and queries are
removed using a stop-word list of 598 words, and
the remaining words are stemmed by Porter Stem-
mer6. As for the automatic evaluation of summa-
rization, ROUGE (Recall-Oriented Understudy for
Gisting Evaluation) measures, including ROUGE-
1, ROUGE-2, and ROUGE-SU47 and their corre-
sponding 95% confidence intervals, are used to eval-
uate the performance of the summaries. In order to
obtain a more comprehensive measure of summary
quality, we also conduct manual evaluation on TAC
data with reference to (Haghighi and Vanderwende,
2009; Celikyilmaz and Hakkani-Tur, 2011; Delort
and Alfonseca, 2011).

5.2 Comparison with other Bayesian models
In this subsection, we compare our model with the
following Bayesian baselines:

KL-sum: It is developed by Haghighi and
Vanderwende (Lin et al., 2006) by using a KL-
divergence based sentence selection strategy.

KL(Ps||Qd) =
∑

w

P(w)log
P(w)

Q(w)
(19)

where Ps is the unigram distribution of candidate
summary and Qd denotes the unigram distribution of
document collection. Sentences with higher ranking
score is selected into the summary.

HierSum: A LDA based approach proposed by
Haghighi and Vanderwende (2009), where unigram
distribution is calculated from LDA topic model in
Equ.(14).

Hybhsum: A supervised approach developed by
Celikyilmaz and Hakkani-Tur (2010).

For fair comparison, baselines use the same pro-
precessing methods with our model and all sum-

5Here, we only use the docset-A data in TAC, since TAC
data is composed of docset-A and docset-B data, and the docset-
B data is mainly for the update summarization task.

6http://tartarus.org/ martin/PorterStemmer/.
7Jackknife scoring for ROUGE is used in order to compare

with the human summaries.

94


maries are truncated to the same length of 100
words. From Table 1 and Table 2, we can

Methods ROUGE-1 ROUGE-2 ROUGE-SU4
Our 0.3724 0.1030 0.1342

approach (0.3660-0.3788) (0.0999-0.1061) (0.1290-0.1394)

Hybhsum
0.3703 0.1007 0.1314

(0.3600-0.3806) (0.0952-0.1059) (0.1241-0.1387)

HierSum
0.3613 0.0948 0.1278

(0.3374-0.3752) (0.0899-0.0998) (0.1197-0.1359)

KLsum
0.3504 0.0917 0.1234

(0.3411-0.3597) (0.0842-0.0992) (0.1155-0.1315)

StandLDA
0.3368 0.0797 0.1156

(0.3252-0.3386) (0.0758-0.0836) (0.1072-0.1240)

Table 1: Comparison of Bayesian models on TAC2008

Methods ROUGE-1 ROUGE-2 ROUGE-SU4
Our 0.3903 0.1223 0.1488

approach (0.3819-0.3987) (0.1167-0.1279) (0.1446-0.1530)

Hybhsum
0.3824 0.1173 0.1436

(0.3686-0.3952) (0.1132-0.1214) (0.1358-0.1514)

HierSum
0.3706 0.1088 0.1386

(0.3624-0.3788) (0.0950-0.1144) (0.1312-0.1464)

KLsum
0.3619 0.0972 0.1299

(0.3510-0.3728) (0.0917-0.1047) (0.1213-0.1385)

StandLDA
0.3552 0.0847 0.1214

(0.3447-0.3657) (0.0813-0.0881) (0.1141-0.1286)

Table 2: Comparison of Bayesian models on TAC2009

see that among all the Bayesian baselines, Hybh-
sum achieves the best result. This further illus-
trates the advantages of combining topic model with
supervised method. In Table 1, we can see that
our S-sLDA model performs better than Hybhsum
and the improvements are 3.4% and 3.7% with re-
spect to ROUGE-2 and ROUGE-SU4 on TAC2008
data. The comparison can be extended to TAC2009
data as shown in Table 2: the performance of S-
sLDA is above Hybhsum by 4.3% in ROUGE-2
and 5.1% in ROUGE-SU4. It is worth explaining
that these achievements are significant, because in
the TAC2008 evaluation, the performance of the top
ranking systems are very close, i.e. the best system
is only 4.2% above the 4th best system on ROUGE-
2 and 1.2% on ROUGE-SU4.

5.3 Comparison with other baselines.
In this subsection, we compare our model with some
widely used models in summarization.

Manifold: It is the one-layer graph based semi-
supervised summarization approach developed by
Wan et al.(2008). The graph is constructed only con-
sidering sentence relations using tf-idf and neglects

topic information.
LexRank: Graph based summarization approach

(Erkan and Radev, 2004), which is a revised version
of famous web ranking algorithm PageRank. It is
an unsupervised ranking algorithms compared with
Manifold.

SVM: A supervised method - Support Vector Ma-
chine (SVM) (Vapnik 1995) which uses the same
features as our approach.

MEAD: A centroid based summary algorithm by
Radev et al. (2004). Cluster centroids in MEAD
consists of words which are central not only to one
article in a cluster, but to all the articles. Similarity
is measure using tf-idf.

At the same time, we also present the top three
participating systems with regard to ROUGE-2 on
TAC2008 and TAC2009 for comparison, denoted as
(denoted as SysRank 1st, 2nd and 3rd)(Gillick et al.,
2008; Zhang et al., 2008; Gillick et al., 2009; Varma
et al., 2009). The ROUGE scores of the top TAC
system are directly provided by the TAC evaluation.

From Table 3 and Table 4, we can see that
our approach outperforms the baselines in terms of
ROUGE metrics consistently. When compared with
the standard supervised method SVM, the relative
improvements over the ROUGE-1, ROUGE-2 and
ROUGE-SU4 scores are 4.3%, 13.1%, 8.3% respec-
tively on TAC2008 and 7.2%, 14.9%, 14.3% on
TAC2009. Our model is not as good as top par-
ticipating systems on TAC2008 and TAC2009. But
considering the fact that our model neither uses sen-
tence compression algorithm nor leverage domain
knowledge bases like Wikipedia or training data,
such small difference in ROUGE scores is reason-
able.

5.4 Manual Evaluations

In order to obtain a more accurate measure of sum-
mary quality for our S-sLDA model and Hybhsum,
we performed a simple user study concerning the
following aspects: (1) Overall quality: Which sum-
mary is better overall? (2) Focus: Which summary
contains less irrelevant content? (3)Responsiveness:
Which summary is more responsive to the query.
(4) Non-Redundancy: Which summary is less re-
dundant? 8 judges who specialize in NLP partic-
ipated in the blind evaluation task. Evaluators are
presented with two summaries generated by S-sLDA

95


Methods ROUGE-1 ROUGE-2 ROUGE-SU4
Our 0.3724 0.1030 0.1342

approach (0.3660-0.3788) (0.0999-0.1061) (0.1290-0.1394)

SysRank 1st
0.3742 0.1039 0.1364

(0.3639-0.3845) (0.0974-0.1104) (0.1285-0.1443)

SysRank 2nd
0.3717 0.0990 0.1326

(0.3610-0.3824 (0.0944-0.1038) (0.1269-0.1385)

SysRank 3rd
0.3710 0.0977 0.1329

(0.3550-0.3849) (0.0920-0.1034) (0.1267-0.1391)

PageRank
0.3597 0.0879 0.1221

(0.3499-0.3695) (0.0809-0.0950) (0.1173-0.1269)

Manifold
0.3621 0.0931 0.1243

(0.3506-0.3736) (0.0868-0.0994) (0.1206-0.1280)

SVM
0.3588 0.0921 0.1258

(0.3489-0.3687) (0.0882-0.0960) (0.1204-0.1302)

MEAD
0.3558 0.0917 0.1226

(0.3489-0.3627) (0.0882-0.0952) (0.1174-0.1278)

Table 3: Comparison with baselines on TAC2008

Methods ROUGE-1 ROUGE-2 ROUGE-SU4
Our 0.3903 0.1223 0.1488

approach (0.3819-0.3987) (0.1167-0.1279) (0.1446-0.1530)

SysRank 1st
0.3917 0.1218 0.1505

(0.3778-0.4057) (0.1122-0.1314) (0.1414-0.1596)

SysRank 2nd
0.3914 0.1212 0.1513

(0.3808-0.4020) (0.1147-0.1277) (0.1455-0.1571)

SysRank 3rd
0.3851 0.1084 0.1447

(0.3762-0.3932) (0.1025-0.1144) (0.1398-0.1496)

PageRank
0.3616 0.0849 0.1249

(0.3532-0.3700) (0.0802-0.0896) (0.1221-0.1277)

Manifold
0.3713 0.1014 0.1342

(0.3586-0.3841) (0.0950-0.1178) (0.1299-0.1385)

SVM
0.3649 0.1028 0.1319

(0.3536-0.3762) (0.0957-0.1099) (0.1258-0.1380)

MEAD
0.3601 0.1001 0.1287

(0.3536-0.3666) (0.0953-0.1049) (0.1228-0.1346)

Table 4: Comparison with baselines on TAC2009

and Hybhsum, as well as the four questions above.
Then they need to answer which summary is better
(tie). We randomly select 20 document collections
from TAC 2008 data and randomly assign two sum-
maries for each collection to three different evalua-
tors to judge which model is better in each aspect.

As we can see from Table 5, the two models al-
most tie with respect to Non-redundancy, mainly
because both models have used appropriate MMR
strategies. But as for Overall quality, Focus and

Our(win) Hybhsum(win) Tie
Overall 37 14 9
Focus 32 18 10

Responsiveness 33 13 14
Non-redundancy 13 11 36

Table 5: Comparison with baselines on TAC2009

Responsiveness, S-sLDA model outputs Hybhsum
based on t-test on 95% confidence level. Ta-
ble 6 shows the example summaries generated re-
spectively by two models for document collection
D0803A-A in TAC2008, whose query is “Describe
the coal mine accidents in China and actions taken“.
From table 6, we can see that each sentence in these
two summaries is somewhat related to topics of coal
mines in China. We also observe that the summary
in Table 6(a) is better than that in Table 6(b), tend-
ing to select shorter sentences and provide more in-
formation. This is because, in S-sLDA model, topic
modeling is determined simultaneously by various
features including terms and other ones such as sen-
tence length, sentence position and so on, which
can contribute to summary quality. As we can see,
in Table 6(b), sentences (3) and (5) provide some
unimportant information such as “somebody said“,
though they contain some words which are related
to topics about coal mines.

(1)China to close at least 4,000 coal mines this year:
official (2)By Oct. 10 this year there had been 43 coal
mine accidents that killed 10 or more people, (3)Offi-
cials had stakes in coal mines. (4)All the coal mines
will be closed down this year. (5) In the first eight
months, the death toll of coal mine accidents rose
8.5 percent last year. (6) The government has issued
a series of regulations and measures to improve the
coun.try’s coal mine safety situation. (7)The mining
safety technology and equipments have been sold to
countries. (8)More than 6,000 miners died in accidents
in China

(1) In the first eight months, the death toll of coal mine
accidents across China rose 8.5 percent from the same
period last year. (2)China will close down a number of
ill-operated coal mines at the end of this month, said
a work safety official here Monday. (3) Li Yizhong,
director of the National Bureau of Production Safety
Supervision and Administration, has said the collusion
between mine owners and officials is to be condemned.
(4)from January to September this year, 4,228 people
were killed in 2,337 coal mine accidents. (5) Chen
said officials who refused to register their stakes in
coal mines within the required time

Table 6: Example summary text generated by systems
(a)S-sLDA and (b) Hybhsum. (D0803A-A, TAC2008)

96


6 Conclusion

In this paper, we propose a novel supervised ap-
proach based on revised supervised topic model for
query-focused multi document summarization. Our
approach naturally combines Bayesian topic model
with supervised method and enjoy the advantages of
both models. Experiments on benchmark demon-
strate good performance of our model.

Acknowledgments

This research work has been supported by
NSFC grants (No.90920011 and No.61273278),
National Key Technology R&D Program
(No:2011BAH1B0403), and National High Tech-
nology R&D Program (No.2012AA011101). We
also thank the three anonymous reviewers for their
helpful comments. Corresponding author: Sujian
Li.

References
David Blei and Jon McAuliffe. Supervised topic models.

2007. In Neural Information Processing Systems
David Blei, Andrew Ng and Micheal Jordan. Latent

dirichlet allocation. In The Journal of Machine Learn-
ing Research, page: 993-1022.

Charles Broyden. 1965. A class of methods for solv-
ing nonlinear simultaneous equations. In Math. Comp.
volume 19, page 577-593.

Jaime Carbonell and Jade Goldstein. 1998. The use of
MMR, diversity-based reranking for reordering doc-
uments and producing summaries. In Proceedings of
the 21st annual international ACM SIGIR conference
on Research and development in information retrieval.

Asli Celikyilmaz and Dilek Hakkani-Tur. 2010. A Hy-
brid hierarchical model for multi-document summa-
rization. In Proceedings of the 48th Annual Meeting of
the Association for Computational Linguistics. page:
815-825

Jade Goldstein, Mark Kantrowitz, Vibhu Mittal and
Jaime Carbonell. 1999. Summarizing Text Docu-
ments: Sentence Selection and Evaluation Metrics. In
Proceedings of the 22nd annual international ACM SI-
GIR conference on Research and development in infor-
mation retrieval, page: 121-128.

Amit Grubber, Micheal Rosen-zvi and Yair Weiss. 2007.
Hidden Topic Markov Model. In Artificial Intelligence
and Statistics.

Hal Daume and Daniel Marcu H. 2006. Bayesian Query-
Focused Summarization. In Proceedings of the 21st

International Conference on Computational Linguis-
tics and the 44th annual meeting of the Association for
Computational Linguistics, page 305-312.

Gune Erkan and Dragomir Radev. 2004. Lexrank: graph-
based lexical centrality as salience in text summariza-
tion. In J. Artif. Intell. Res. (JAIR), page 457-479.

Dan Gillick, Benoit Favre, Dilek Hakkani-Tur, The ICSI
Summarization System at TAC, TAC 2008.

Dan Gillick, Benoit Favre, and Dilek Hakkani-Tur,
Berndt Bohnet, Yang Liu, Shasha Xie. The ICSI/UTD
Summarization System at TAC 2009. TAC 2009

Aria Haghighi and Lucy Vanderwende. 2009. Exploring
content models for multi-document summarization. In
Proceedings of Human Language Technologies: The
2009 Annual Conference of the North American Chap-
ter of the Association for Computational Linguistics,
pages 362370.

Feng Jin, Minlie Huang, and Xiaoyan Zhu. 2010. The
summarization systems at tac 2010. In Proceedings of
the third Text Analysis Conference, TAC-2010.

Liangda Li, Ke Zhou, Gui-Rong Xue, Hongyuan Zha and
Yong Yu. 2009. Enhancing diversity, coverage and bal-
ance for summarization through structure learning. In
Proceedings of the 18th international conference on
World wide web, page 71-80.

Chin-Yew Lin, Guihong Gao, Jianfeng Gao and Jian-Yun
Nie. 2006. An information-theoretic approach to au-
tomatic evaluation of summaries. In Proceedings of
the main conference on Human Language Technology
Conference of the North American Chapter of the As-
sociation of Computational Linguistics, page:462-470.

Annie Louis, Aravind Joshi, Ani Nenkova. 2010. Dis-
course indicators for content selection in summariza-
tion. In Proceedings of the 11th Annual Meeting of
the Special Interest Group on Discourse and Dialogue,
page:147-156.

Tengfei Ma, Xiaojun Wan. 2010. Multi-document sum-
marization using minimum distortion, in Proceedings
of International Conference of Data Mining. page
354363.

Rebecca Mason and Eugene Charniak. 2011. Extractive
multi-document summaries should explicitly not con-
tain document-specific content. In proceedings of ACL
HLT, page:49-54.

Ani Nenkova and Lucy Vanderwende. The impact of fre-
quency on summarization. In Tech. Report MSR-TR-
2005-101, Microsoft Research, Redwood, Washing-
ton, 2005.

Ani Nenkova, Lucy Vanderwende and Kathleen McKe-
own. 2006. A compositional context sensitive multi-
document summarizer: exploring the factors that inu-
ence summarization. In Proceedings of the 29th an-
nual International ACM SIGIR Conference on Re-

97


search and Development in Information Retrieval,
page 573-580.

Miles Osborne. 2002. Using maximum entropy for sen-
tence extraction. In Proceedings of the ACL-02 Work-
shop on Automatic Summarization, Volume 4 page:1-
8.

Jahna Otterbacher, Gunes Erkan and Dragomir Radev.
2005. Using random walks for question-focused sen-
tence retrieval. In Proceedings of the Conference on
Human Language Technology and Empirical Methods
in Natural Language Processing, page 915-922

You Ouyang, Wenjie Li, Sujian Li and Qin Lua. 2011.
Applying regression models to query-focused multi-
document summarization. In Information Processing
and Management, page 227-237.

You Ouyang, Sujian. Li, and Wenjie. Li. 2007, Develop-
ing learning strategies for topic-based summarization.
In Proceedings of the sixteenth ACM conference on
Conference on information and knowledge manage-
ment, page: 7986.

Daniel Ramage, David Hall, Ramesh Nallapati and
Christopher Manning. 2009. Labeled LDA: A super-
vised topic model for credit attribution in multi-labeled
corpora. In Proceedings of the 2009 Conference on
Empirical Methods in Natural Language Processing,
Vol 1, page 248-256.

Dou She, Jian-Tao Sun, Hua Li, Qiang Yang and
Zheng Chen. 2007. Document summarization using
conditional random elds. In Proceedings of Inter-
national Joint Conference on Artificial Intelligence,
page: 28622867.

V. Varma, V. Bharat, S. Kovelamudi, P. Bysani, S. GSK,
K. Kumar N, K. Reddy, N. Maganti , IIIT Hyderabad
at TAC 2009. TAC2009

Xiaojun Wan and Jianwu Yang. 2008. Multi-document
Summarization using cluster-based link analysis. In
Proceedings of the 31st annual international ACM SI-
GIR conference on Research and development in in-
formation retrieval, page: 299-306.

Xiaojun Wan, Jianwu Yang and Jianguo Xiao. 2007.
Manifold-ranking based topic-focused multi-
document summarization. In Proceedings of In-
ternational Joint Conference on Artificial Intelligence,
page 2903-2908.

Furu Wei, Wenjie Li, Qin Lu and Yanxiang He. 2008. Ex-
ploiting Query-Sensitive Similarity for Graph-Based
Query-Oriented Summarization. In Proceedings of the
31st annual International ACM SIGIR Conference on
Research and Development in Information Retrieval,
page 283-290.

Jin Zhang, Xueqi Cheng, Hongbo Xu, Xiaolei Wang, Yil-
ing Zeng. ICTCAS’s ICTGrasper at TAC 2008: Sum-
marizing Dynamic Information with Signature Terms
Based Content Filtering, TAC 2008.

Dengzhong Zhou, Jason Weston, Arthur Gretton, Olivier
Bousquet and Bernhard Schlkopf. 2003. Ranking on
Data Manifolds. In Proceedings of the Conference on
Advances in Neural Information Processing Systems,
page 169-176.

Jun Zhu and Eric Xing. 2010. Conditional Topic Random
Fields. In Proceedings of the 27th International Con-
ference on Machine Learning.

Xiaojin Zhu, Zoubin Ghahramani and John Laf-
ferty. 2003. Semi-supervised Learning using Gaussian
Fields and Harmonic Functions. In Proceedings of In-
ternational Conference of Machine Learning, page:
912-919.

98