Cross-lingual Projected Expectation Regularization for Weakly Supervised Learning


Cross-lingual Projected Expectation Regularization for
Weakly Supervised Learning

Mengqiu Wang and Christopher D. Manning
Computer Science Department

Stanford University
Stanford, CA 94305 USA

{mengqiu,manning}@cs.stanford.edu

Abstract

We consider a multilingual weakly supervised
learning scenario where knowledge from an-
notated corpora in a resource-rich language
is transferred via bitext to guide the learning
in other languages. Past approaches project
labels across bitext and use them as features
or gold labels for training. We propose a
new method that projects model expectations
rather than labels, which facilities transfer
of model uncertainty across language bound-
aries. We encode expectations as constraints
and train a discriminative CRF model using
Generalized Expectation Criteria (Mann and
McCallum, 2010). Evaluated on standard
Chinese-English and German-English NER
datasets, our method demonstrates F1 scores
of 64% and 60% when no labeled data is
used. Attaining the same accuracy with su-
pervised CRFs requires 12k and 1.5k labeled
sentences. Furthermore, when combined with
labeled examples, our method yields signifi-
cant improvements over state-of-the-art super-
vised methods, achieving best reported num-
bers to date on Chinese OntoNotes and Ger-
man CoNLL-03 datasets.

1 Introduction

Supervised statistical learning methods have en-
joyed great popularity in Natural Language Process-
ing (NLP) over the past decade. The success of su-
pervised methods depends heavily upon the avail-
ability of large amounts of annotated training data.
Manual curation of annotated corpora is a costly and
time consuming process. To date, most annotated re-
sources resides within the English language, which

hinders the adoption of supervised learning methods
in many multilingual environments.

To minimize the need for annotation, significant
progress has been made in developing unsupervised
and semi-supervised approaches to NLP (Collins
and Singer 1999; Klein 2005; Liang 2005; Smith
2006; Goldberg 2010; inter alia) . More recent
paradigms for semi-supervised learning allow mod-
elers to directly encode knowledge about the task
and the domain as constraints to guide learning
(Chang et al., 2007; Mann and McCallum, 2010;
Ganchev et al., 2010). However, in a multilingual
setting, coming up with effective constraints require
extensive knowledge of the foreign1 language.

Bilingual parallel text (bitext) lends itself as a
medium to transfer knowledge from a resource-rich
language to a foreign languages. Yarowsky and Ngai
(2001) project labels produced by an English tag-
ger to the foreign side of bitext, then use the pro-
jected labels to learn a HMM model. More recent
work applied the projection-based approach to more
language-pairs, and further improved performance
through the use of type-level constraints from tag
dictionary and feature-rich generative or discrimina-
tive models (Das and Petrov, 2011; Täckström et al.,
2013).

In our work, we propose a new projection-based
method that differs in two important ways. First,
we never explicitly project the labels. Instead, we
project expectations over the labels. This projection

1For experimental purposes, we designate English as the
resource-rich language, and other languages of interest as “for-
eign”. In our experiments, we simulate the resource-poor sce-
nario using Chinese and German, even though in reality these
two languages are quite rich in resources.

55

Transactions of the Association for Computational Linguistics, 2 (2014) 55–66. Action Editor: Lillian Lee.
Submitted 9/2013; Revised 12/2013; Published 2/2014. c©2014 Association for Computational Linguistics.


acts as a soft constraint over the labels, which al-
lows us to transfer more information and uncertainty
across language boundaries. Secondly, we encode
the expectations as constraints and train a model by
minimizing divergence between model expectations
and projected expectations in a Generalized Expec-
tation (GE) Criteria (Mann and McCallum, 2010)
framework.

We evaluate our approach on Named Entity
Recognition (NER) tasks for English-Chinese and
English-German language pairs on standard public
datasets. We report results in two settings: a weakly
supervised setting where no labeled data or a small
amount of labeled data is available, and a semi-
supervised settings where labeled data is available,
but we can gain predictive power by learning from
unlabeled bitext.

2 Related Work

Most semi-supervised learning approaches embody
the principle of learning from constraints. There are
two broad categories of constraints: multi-view con-
straints, and external knowledge constraints.

Examples of methods that explore multi-view
constraints include self-training (Yarowsky, 1995;
McClosky et al., 2006),2 co-training (Blum and
Mitchell, 1998; Sindhwani et al., 2005), multi-
view learning (Ando and Zhang, 2005; Carlson et
al., 2010), and discriminative and generative model
combination (Suzuki and Isozaki, 2008; Druck and
McCallum, 2010).

An early example of using knowledge as con-
straints in weakly-supervised learning is the work
by Collins and Singer (1999). They showed that the
addition of a small set of “seed” rules greatly im-
prove a co-training style unsupervised tagger. Chang
et al. (2007) proposed a constraint-driven learning
(CODL) framework where constraints are used to
guide the selection of best self-labeled examples to
be included as additional training data in an iterative
EM-style procedure. The kind of constraints used
in applications such as NER are the ones like “the
words CA, Australia, NY are LOCATION” (Chang
et al., 2007). Notice the similarity of this partic-

2A multi-view interpretation of self-training is that the self-
tagged additional data offers new views to learners trained on
existing labeled data.

ular constraint to the kinds of features one would
expect to see in a discriminative MaxEnt model.
The difference is that instead of learning the valid-
ity (or weight) of this feature from labeled exam-
ples — since we do not have them — we can con-
strain the model using our knowledge of the domain.
Druck et al. (2009) also demonstrated that in an ac-
tive learning setting where annotation budget is lim-
ited, it is more efficient to label features than ex-
amples. Other sources of knowledge include lexi-
cons and gazetteers (Druck et al., 2007; Chang et
al., 2007).

While it is straight-forward to see how resources
such as a list of city names can give a lot of mileage
in recognizing locations, we are also exposed to the
danger of over-committing to hard constraints. For
example, it becomes problematic with city names
that are ambiguous, such as Augusta, Georgia.3

To soften these constraints, Mann and McCallum
(2010) proposed the Generalized Expectation (GE)
Criteria framework, which encodes constraints as a
regularization term over some score function that
measures the divergence between the model’s ex-
pectation and the target expectation. The connection
between GE and CODL is analogous to the relation-
ship between hard (Viterbi) EM and soft EM, as il-
lustrated by Samdani et al. (2012).

Another closely related work is the Posterior
Regularization (PR) framework by Ganchev et al.
(2010). In fact, as Bellare et al. (2009) have shown,
in a discriminative model these two methods opti-
mize exactly the same objective.4 The two differ
in optimization details: PR uses a EM algorithm
to approximate the gradients which avoids the ex-
pensive computation of a covariance matrix between
features and constraints, whereas GE directly cal-
culates the gradient. However, later results (Druck,
2011) have shown that using the Expectation Semir-
ing techniques of Li and Eisner (2009), one can
compute the exact gradients of GE in a Conditional
Random Fields (CRF) (Lafferty et al., 2001) at costs

3This is a city in the state of Georgia in USA, famous for its
golf courses. It is ambiguous since both Augusta and Georgia
can also be used as person names.

4The different terminology employed by GE and PR may
be confusing to discerning readers, but the “expectation” in the
context of GE means the same thing as “marginal posterior” as
in PR.

56


no greater than computing the gradients of ordinary
CRF. And empirically, GE tends to perform more ac-
curately than PR (Bellare et al., 2009; Druck, 2011).

Obtaining appropriate knowledge resources for
constructing constraints remain as a bottleneck in
applying GE and PR to new languages. However,
a number of past work recognizes parallel bitext as a
rich source of linguistic constraints, naturally cap-
tured in the translations. As a result, bitext has
been effectively utilized for unsupervised multilin-
gual grammar induction (Alshawi et al., 2000; Sny-
der et al., 2009), parsing (Burkett and Klein, 2008),
and sequence labeling (Naseem et al., 2009).

A number of recent work also explored bilin-
gual constraints in the context of simultaneous bilin-
gual tagging, and showed that enforcing agreements
between language pairs give superior results than
monolingual tagging (Burkett et al., 2010; Che et
al., 2013; Wang et al., 2013a). Burkett et al. (2010)
also demonstrated a uptraining (Petrov et al., 2010)
setting where tag-induced bitext can be used as ad-
ditional monolingual training data to improve mono-
lingual taggers. A major drawback of this approach
is that it requires a readily-trained tagging models in
each languages, which makes a weakly supervised
setting infeasible. Another intricacy of this approach
is that it only works when the two models have com-
parable strength, since mutual agreements are en-
forced between them.

Projection-based methods can be very effective
in weakly-supervised scenarios, as demonstrated by
Yarowsky and Ngai (2001), and Xi and Hwa (2005).
One problem with projected labels is that they are
often too noisy to be directly used as training sig-
nals. To mitigate this problem, Das and Petrov
(2011) designed a label propagation method to au-
tomatically induce a tag lexicon for the foreign lan-
guage to smooth the projected labels. Fossum and
Abney (2005) filter out projection noise by com-
bining projections from from multiple source lan-
guages. However, this approach is not always viable
since it relies on having parallel bitext from multi-
ple source languages. Li et al. (2012) proposed the
use of crowd-sourced Wiktionary as additional re-
sources for inducing tag lexicons. More recently,
Täckström et al. (2013) combined token-level and
type-level constraints to constrain legitimate label
sequences and and recalibrate the probability distri-

bution in a CRF. The tag dictionary used for POS
tagging are analogous to the gazetteers and name
lexicons used for NER by Chang et al. (2007).

Our work is also closely related to Ganchev et
al. (2009). They used a two-step projection method
similar to Das and Petrov (2011) for dependency
parsing. Instead of using the projected linguis-
tic structures as ground truth (Yarowsky and Ngai,
2001), or as features in a generative model (Das
and Petrov, 2011), they used them as constraints
in a PR framework. Our work differs by project-
ing expectations rather than Viterbi one-best labels.
We also choose the GE framework over PR. Experi-
ments in Bellare et al. (2009) and Druck (2011) sug-
gest that in a discriminative model (like ours), GE
is more accurate than PR. More recently, Ganchev
and Das (2013) further extended this line of work
to directly train discriminative sequence models us-
ing cross lingual projection with PR. The types of
constraints applied in this new work are similar to
the ones in the monolingual PR setting proposed by
Ganchev et al. (2010), where the total counts of la-
bels of a particular kind are expected to match some
fraction of the projected total counts. Our work dif-
fer in that we enforce expectation constraints at to-
ken level, which gives tighter guidance to learning
the model.

3 Approach

Given bitext between English and a foreign lan-
guage, our goal is to learn a CRF model in the
foreign language from little or no labeled data.
Our method performs Cross-Lingual Projected
Expectation Regularization (CLiPER).

For every aligned sentence pair in the bitext, we
first compute the posterior marginal at each word po-
sition on the English side using a pre-trained English
CRF tagger; then for each aligned English word, we
project its posterior marginal as expectations to the
aligned word position on the foreign side. Figure 1
shows a snippet of a sentence from real corpus. No-
tice that if we were to directly project the Viterbi
best assignment from English to Chinese, all three
Chinese words that are named entities would have
gotten the wrong tags. But projecting the English
CRF model expectations preserves some uncertain-
ties, informing the Chinese model that there is a 40%

57


a reception in Luobu Linka . . . . . . met with representatives of Zhongguo Ribao

O:0.0032 O:0.0037 GPE:0.0000 GPE:0.0000PER:0.0000 PER:0.0000 PER:0.0000

GPE:0.0042 GPE:0.0042 LOC:0.0003 LOC:0.0003GPE:0.0000 GPE:0.0000 GPE:0.0000

ORG:0.0308 ORG:0.0307 O:0.0012 O:0.0011ORG:0.0000 ORG:0.0000 ORG:0.0000

LOC:0.3250 LOC:0.3256 ORG:0.4060 ORG:0.4061LOC:0.0000 LOC:0.0000 LOC:0.0000

PER:0.6369 PER:0.6377 PER:0.5925 PER:0.5925O:1.0000 O:1.0000 O:1.0000

在 罗布林卡 举行 的 招待会 . . . . . . 会见 了 中国 日报 代表
PER:0.6373 PER:0.5925 PER:0.5925O:1.0000 O:1.0000 O:1.0000

LOC:0.3253 ORG:0.4060 ORG:0.4061LOC:0.0000 LOC:0.0000 LOC:0.0000

ORG:0.0307 O:0.0012 O:0.0011ORG:0.0000 ORG:0.0000 ORG:0.0000

GPE:0.0042 LOC:0.0003 LOC:0.0003GPE:0.0000 GPE:0.0000 GPE:0.0000

O:0.0035 GPE:0.0000 GPE:0.0000PER:0.0000 PER:0.0000 PER:0.0000

Figure 1: Diagram illustrating the projection of model expectation from English to Chinese. The posterior
probabilities assigned by the English CRF model is shown above each English word; automatically induced
word alignments are shown in red; the correct projected labels for Chinese words are shown in green, and
incorrect labels are shown in red.

chance that “中国日报” (China Daily) is an organi-
zation in this context.

We would like to learn a CRF model in the for-
eign language that has similar expectations as the
projected expectations from English. To this end,
we adopt the Generalized Expectation (GE) Crite-
ria framework introduced by Mann and McCallum
(2010). In the remainder of this section, we follow
the notation used in (Druck, 2011) to explain our ap-
proach.

3.1 CLiPER

The general idea of GE is that we can express our
preferences over models through constraint func-
tions. A desired model should satisfy the imposed
constraints by matching the expectations on these
constraint functions with some target expectations
(attained by external knowledge like lexicons or in
our case transferred knowledge from English). We
define a constraint function φi,lj for each word po-
sition i and output label assignment lj. φi,lj = 0 is
a constraint in that position i cannot take label lj.

The set {l1, · · · , lm} denotes all possible label as-
signment for each yi, and m is number of label val-
ues. Ai is the set of English words aligned to Chi-
nese word i. φi,lj are defined for all position i such
that Ai 6= ∅. In other words, the constraint function
applies only to Chinese word positions that have at
least one aligned English word. Each φi,lj (y) can
be treated as a Bernoulli random variable, and we
concatenate the set of all φi,lj into a random vector

φ(y), where φk = φi,lj if k = i∗m + j. We drop
the (y) in φ for simplicity.

The target expectation over φi,lj , denoted as φ̃i,lj ,
is the expectation of assigning label lj to English
word Ai under the English conditional probability
model. When multiple English words are aligned to
the same foreign word, we average the expectations.

The expectation over φ under a conditional prob-
ability model P(y|x; θ) is denoted as EP(y|x;θ)[φ],
and simplified as Eθ[φ] whenever it is unambigu-
ous.

The conditional probability model P(y|x; θ) in
our case is defined as a standard linear-chain CRF:5

P(y|x; θ) = 1
Z(x; θ)

exp

(
n∑

i

θf(x,yi,yi−1)

)

where f is a set of feature functions; θ are the match-
ing parameters to learn; n = |x|.

The objective function to maximize in a standard
CRF is the log probability over a collection of la-
beled documents:

LCRF (θ) =

a′∑

a=1

log P(y∗a|xa; θ) (1)

a′ is the number of labeled sentences. y∗ is an ob-
served label sequence.

The objective function to maximize in GE is de-
fined as the sum over all unlabeled examples on the

5We simplify notation by dropping the L2 regularizer in the
CRF definition, but apply it in our experiments.

58


foreign side of bitext, denoted as xb, over some cost
function S between the model expectation over φ
(Eθ[φ]) and the target expectation (φ̃).

We choose S to be the negative L22 squared error
sum6 defined as:

LGE(θ) =
n′∑

b=1

S
(
EP(yb|xb;θ)[φ(yb)],φ̃b

)

=
n′∑

b=1

−‖φ̃b −Eθ[φ(yb)]‖22 (2)

n′ is the total number of unlabeled bitext sentence
pairs.

When both labeled and bitext training data are
available, the joint objective is the sum of Eqn. 1
and 2. Each is computed over the labeled training
data and foreign half in the bitext, respectively.

We can optimize this joint objective by comput-
ing the gradients and use a gradient-based optimiza-
tion method such as L-BFGS. Gradients of LCRF
decomposes down to the gradients over each la-
beled training example (x,y∗). Computing the gra-
dient of LGE decomposes down to the gradients of
S(EP(y|xb;θ[φ]) for each unlabeled foreign sentence
x and the constraints over this example φ . The gra-
dients can be calculated as:

∂

∂θ
S(Eθ[φ]) = −

∂

∂θ

(
φ̃−Eθ[φ]

)T (
φ̃−Eθ[φ]

)

= 2
(
φ̃−Eθ[φ]

)T ( ∂
∂θ
Eθ[φ]

)

We redefine the penalty vector u = 2
(
φ̃−Eθ[φ]

)

to be u. ∂
∂θ
Eθ[φ] is a matrix where each column

contains the gradients for a particular model feature
θ with respect to all constraint functions φ. It can be

6In general, other loss functions such as KL-divergence can
also be used for S. We found L22 to work well in practice.

computed as:

∂

∂θ
Eθ[φ] =

∑

y

φ(y)
∂

∂θ
P(y|x; θ)

=
∑

y

φ(y)
∂

∂θ

(
1

Z(x; θ)
exp(θT f(x,y))

)

=
∑

y

φ(y)

(
1

Z(x; θ)

(
∂

∂θ
exp(θT f(x,y))

)

+ exp(θT f(x,y))

(
∂

∂θ

1

Z(x; θ)

))

=
∑

y

φ(y)

(
P(y|x; θ)f(x,y)T

−P(y|x; θ)
∑

y′
P(y′|x; θ)f(x,y′)T

)

=
∑

y

P(y|x; θ)
∑

y

φ(y)f(x,y)T

−
(∑

y

P(y|x; θ)φ(y)
)(∑

y

P(y|x; θ)f(x,y)T
)

= COVP(y|x;θ) (φ(y),f(x,y)) (3)

= Eθ[φf
T ]−Eθ[φ]Eθ[fT ] (4)

Eqn. 3 gives the intuition of how optimization works
in GE. In each iteration of L-BFGS, the model pa-
rameters are updated according to their covariance
with the constraint features, scaled by the differ-
ence between current expectation and target expec-
tation. The term Eθ[φfT ] in Eqn. 4 can be com-
puted using a dynamic programming (DP) algo-
rithm, but solving it directly requires us to store a
matrix of the same dimension as fT in each step
of the DP. We can reduce the complexity by using
the same trick as in (Li and Eisner, 2009) for com-
puting Expectation Semiring. The resulting algo-
rithm has complexity O(nm2), which is the same as
the standard forward-backward inference algorithm
for CRF. (Druck, 2011, 93) gives full details of this
derivation.

3.2 Hard vs. soft Projection
Projecting expectations instead of one-best label as-
signments from English to foreign language can
be thought of as a soft version of the method de-
scribed in (Das and Petrov, 2011) and (Ganchev et

59


al., 2009). Soft projection has its advantage: when
the English model is not certain about its predic-
tions, we do not have to commit to the current best
prediction. The foreign model has more freedom
to form its own belief since any marginal distribu-
tion it produces would deviates from a flat distri-
bution by just about the same amount. In general,
preserving uncertainties till later is a strategy that
has benefited many NLP tasks (Finkel et al., 2006).
Hard projection can also be treated as a special case
in our framework. We can simply recalibrate pos-
terior marginal of English by assigning probability
mass 1 to the most likely outcome, and zero ev-
erything else out, effectively taking the argmax of

the marginal at each word position. We refer to
this version of expectation as the “hard” expecta-
tion. In the hard projection setting, GE training re-
sembles a “project-then-train” style semi-supervised
CRF training scheme (Yarowsky and Ngai, 2001;
Täckström et al., 2013). In such a training scheme,
we project the one-best predictions of English CRF
to the foreign side through word alignments, then in-
clude the newly “tagged” foreign data as additional
training data to a standard CRF in the foreign lan-
guage. Rather than projecting labels on a per-word
basis, Yarowsky and Ngai (2001) also explored an
alternative method for noun-phrase (NP) bracketing
task that amounts to projecting the spans of NPs
based on the observation that individual NPs tend to
retain their sequential spans across translations. We
experimented with the same method for NER, but
found that this method of projecting the NE spans
does not help in reducing noise and actually lowers
model performance.

Besides the difference in projecting expecta-
tions rather than hard labels, our method and the
“project-then-train” scheme also differ by optimiz-
ing different objectives: CRF optimizes maximum
conditional likelihood of the observed label se-
quence, whereas GE minimizes squared error be-
tween model’s expectation and “hard” expectation
based on the observed label sequence. In the case
where squared error loss is replaced with a KL-
divergence loss, GE has the same effect as marginal-
izing out all positions with unknown projected la-
bels, allowing more robust learning of uncertainties
in the model. As we will show in the experimen-

O PER LOC ORG GPE
O 291339 391 141 1281 221

PER 1263 6721 5 56 73
LOC 409 23 546 123 133
ORG 2423 143 52 8387 196
GPE 566 239 69 668 6604

O PER LOC ORG MISC
O 81209 24 38 155 103

PER 77 5725 41 69 10
LOC 49 40 3743 127 60
ORG 178 102 142 4075 91

MISC 175 41 30 114 1826

Table 1: Raw counts in the error confusion matrix of
English CRF models. Top table contains the counts
on OntoNotes test data, and bottom table contains
CoNLL-03 test data counts. Rows are the true la-
bels and columns are the observed labels. For exam-
ple, item at row 2, column 3 of the top table reads:
we observed 5 times where the true label should be
PERSON, but English CRF model output label LO-
CATION.

tal results in Section 4.2, soft projection in combi-
nation of the GE objective significantly outperforms
the project-then-train style CRF training scheme.

3.3 Source-side noise

An additional source of noise comes from errors
generated by the source-side English CRF mod-
els. We know that the English CRF models gives
F1 score of 81.68% on the OntoNotes dataset for
English-Chinese experiment, and 90.45% on the
CoNLL-03 dataset for English-German experiment.
We present a simple way of modeling English-side
noise by picturing the following process: the la-
bels assigned by the English CRF model (denoted
as y) are some noised version of the true labels (de-
noted as y∗). We can recover the probability of the
true labels by marginalizing over the observed la-
bels: P(y∗|x) =

∑
y P(y

∗|y)∗P(y|x). P(y|x) is
the posterior probabilities given by the CRF model,
and we can approximate P(y∗|y) by the column-
normalized error confusion matrix shown in Table 1.
This source-side noise model is likely to be overly
simplistic. Generally speaking, we could build much
more sophisticated noising model for the source-
side, possibly conditioning on context, or capturing
higher-order label sequences.

60


4 Experiments

We conduct experiments on Chinese and German
NER. We evaluate CLiPER in two learning set-
tings: weakly supervised and semi-supervised. In
the weakly supervised setting, we simulate the con-
dition of having no labeled training data, and evalu-
ate the model learned from bitext alone. We then
vary the amount of labeled data available to the
model, and examine the model’s learning curve. In
the semi-supervised setting, we assume our model
has access to the full labeled data; our goal is to
improve performance of the supervised method by
learning from additional bitext.

4.1 Dataset and setup
We used the latest version of Stanford NER Toolkit7

as our base CRF model in all experiments. Fea-
tures for English, Chinese and German CRFs are
documented extensively in (Che et al., 2013) and
(Faruqui and Padó, 2010) and omitted here for
brevity. It it worth noting that the current Stan-
ford NER models include recent improvements from
semi-supervise learning approaches that induces dis-
tributional similarity features from large word clus-
ters. These models represent the current state-of-
the-art in supervised methods, and serve as a very
strong baseline.

For Chinese NER experiments, we follow the
same setup as Che et al. (2013) to evaluate on the
latest OntoNotes (v4.0) corpus (Hovy et al., 2006).8

A total of 8,249 sentences from the parallel Chinese
and English Penn Treebank portion 9 are reserved
for evaluation. Odd-numbered documents are used
as development set, and even-numbered documents
are held out as blind test set. The rest of OntoNotes
annotated with NER tags are used to train the En-
glish and Chinese CRF base taggers. There are
about 16k and 39k labeled sentences for Chinese and
English training, respectively. The English CRF tag-
ger trained on this training corpus gives F1 score
of 81.68% on the OntoNotes test set. Four enti-
ties types10 are used for both Chinese and English
with a IO tagging scheme.11 The English-Chinese

7http://www-nlp.stanford.edu/ner
8LDC catalogue No.: LDC2011T03
9File numbers: chtb 0001-0325, ectb 1001-1078

10 PERSON, LOCATION, ORGANIZATION and GPE.
11We did not adopt the commonly seen BIO tagging scheme

bitext comes from the Foreign Broadcast Informa-
tion Service corpus (FBIS).12 We randomly sampled
80k parallel sentence pairs to use as bitext in our
experiments. It is first sentence aligned using the
Champollion Tool Kit,13 then word aligned with the
BerkeleyAligner.14

For German NER experiments, we evaluate us-
ing the standard CoNLL-03 NER corpus (Sang and
Meulder, 2003). The labeled training set has 12k and
15k sentences, containing four entity types.15 An
English CRF model is also trained on the CoNLL-
03 English data with the same entity types. For bi-
text, we used a randomly sampled set of 40k parallel
sentences from the de-en portion of the News Com-
mentary dataset.16 The English CRF tagger trained
on CoNLL-03 English training corpus gives F1 score
of 90.4% on the CoNLL-03 test set.

We report typed entity precision (P), recall (R)
and F1 score. Statistical significance tests are done
using a paired bootstrap resampling method with
1000 iterations, averaged over 5 runs. We com-
pare against three recently approaches that were in-
troduced in Section 2. They are: semi-supervised
learning method using factored bilingual models
with Gibbs sampling (Wang et al., 2013a); bilin-
gual NER using Integer Linear Programming (ILP)
with bilingual constraints, by (Che et al., 2013);
and constraint-driven bilingual-reranking approach
(Burkett et al., 2010). The code from (Che et al.,
2013) and (Wang et al., 2013a) are publicly avail-
able.17 Code from (Burkett et al., 2010) is obtained
through personal communications.

Since the objective function in Eqn. 2 is non-
convex, we adopted the early stopping training
scheme from (Turian et al., 2010) as the following:
after each iteration in L-BFGS training, the model

(Ramshaw and Marcus, 1999), because when projected across
swapping word alignments, the “B-” and “I-” tag distinction
may not be well-preserved and may introduce additional noise.

12The FBIS corpus is a collection of radio news casts and
contains translations of openly available news and information
from media sources outside the United States. The LDC cata-
logue No. is LDC2003E14.

13champollion.sourceforge.net
14code.google.com/p/berkeleyaligner
15 PERSON, LOCATION, ORGANIZATION and MISCELLA-

NEOUS.
16http://www.statmt.org/wmt13/

training-parallel-nc-v8.tgz
17https://github.com/stanfordnlp/CoreNLP

61


is evaluated against the development set; the train-
ing procedure is terminated if no improvements have
been made in 20 iterations.

4.2 Weakly supervised results

Figure 2a and 2b show results of weakly supervised
learning experiments. Quite remarkably, on Chinese
test set, our proposed method (CLiPER) achieves a
F1 score of 64.4% with 80k bitext, when no labeled
training data is used. In contrast, the supervised
CRF baseline would require as much as 12k labeled
sentences to attain the same accuracy. Results on the
German test set is less striking. With no labeled data
and 40k of bitext, CLiPER performs at F1 of 60.0%,
the equivalent of using 1.5k labeled examples in the
supervised setting. When combined with 1k labeled
examples, performance of CLiPER reaches 69%, a
gain of over 5% absolute over supervised CRF. We
also notice that supervised CRF model learns much
faster in German than Chinese. This result is not too
surprising, since it is well recognized that Chinese
NER is more challenging than German or English.
The best supervised results for Chinese is 10-20%
(F1 score) behind best German and English super-
vised results. Chinese NER relies more on lexical-
ized features, and therefore needs more labeled data
to achieve good coverage. The results suggest that
CLiPER seems to be very effective at transferring
lexical knowledge from English to Chinese.

Figure 2c and 2d compares soft GE projection
with hard GE projection and the “project-then-train”
style CRF training scheme (cf. Section 3.2). We
observe that both soft and hard GE projection sig-
nificantly outperform the “project-then-train” style
training scheme. The difference is especially pro-
nounced on the Chinese results when fewer labeled
examples are available. Soft projection gives better
accuracy than hard projection when no labeled data
is available, and also has a faster learning rate.

Incorporating source-side noise using the method
described in Section 3.3 gives a small improvement
on Chinese with supervised data, increasing F1 score
from 64.40% to 65.50%. This improvement is statis-
tically significant at 92% confidence interval. How-
ever, on the German data, we observe a tiny de-
crease with no statistical significance in F1 score,
dropping from 59.88% to 59.66%. A likely ex-
planation of the difference is that the English CRF

model in the English-Chinese experiment, which is
trained on OntoNotes data, has a much higher error
rate (18.32%) than the English CRF model in the
English-German experiment trained on CoNLL-03
(9.55%). Therefore, modeling noise in the English-
Chinese case is likely to have a greater effect than
the English-German case.

4.3 Semi-supervised results

In the semi-supervised experiments, we let the CRF
model use the full set of labeled examples in addi-
tion to the unlabeled bitext. Results on the test set
are shown in Table 2. All semi-supervised baselines
are tested with the same number of unlabeled bitext
as CLiPER in each language. The “project-then-
train” semi-supervised training scheme severely
hurts performance on Chinese, but gives a small im-
provement on German. Moreover, on Chinese it
learns to achieve high precision but at a significant
loss in recall. On German its behavior is the oppo-
site. Such drastic and erratic imbalance suggest that
this method is not robust or reliable. The other three
semi-supervised baselines (row 3-5) all show im-
provements over the CRF baseline, consistent with
their reported results. CLIPERs gives the best re-
sults on both Chinese and German, yielding statis-
tically significant improvements over all baselines
except for CWD13 on German. The hard projection
version of CLiPER also gives sizable gain over CRF.
However, in comparison, CLIPERs is superior.

The improvements of CLIPERs over CRF on
Chinese test set is over 2.8% in absolute F1. The
improvement over CRF on German is almost a per-
cent. To our knowledge, these are the best reported
numbers on the OntoNotes Chinese and CoNLL-03
German datasets.

4.4 Efficiency

Another advantage of our proposed approach is ef-
ficiency. Because we eliminated the previous multi-
stage “uptraining” paradigm, but instead integrating
the semi-supervised and supervised objective into
one joint objective, we are able to attain signifi-
cant speed improvements over all methods except
CRFptt. Table 3 shows the required training time.

62


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

10

20

30

40

50

60

70

80

# of labeled training sentences [k]

F
1

sc
or

e
[%

]

supervised CRF
CLiPPER soft

(a) Chinese Test

0 1 2 3 4 5 6 7 8 9 10 11 12

0

10

20

30

40

50

60

70

80

# of labeled training sentences [k]

F
1

sc
or

e
[%

]

supervised CRF
CLiPPER soft

(b) German Test

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

44

46

48

50

52

54

56

58

60

62

64

66

68

70

72

74

76

# of labeled training sentences [k]

F
1

sc
or

e
[%

]

CRF projection
CLiPPER hard
CLiPPER soft

(c) Soft vs. Hard on Chinese Test

0 1 2 3 4 5 6 7 8 9 10 11 12
54

56

58

60

62

64

66

68

70

72

74

76

78

80

# of labeled training sentences [k]

F
1

sc
or

e
[%

]

CRF projection
CLiPPER hard
CLiPPER soft

(d) Soft vs. Hard on German Test

[高岗] 纪念碑 在 [横山] 落成

A monument commemorating [Vice President Gao GangPER ] was completed in [HengshanLOC ]

(e) Word proceeding “monument” is PERSON

[碛口] [毛主席] 东渡 [黄河] 纪念碑 简介

Introduction of [QikouLOC ] [Chairman MaoPER ] [Yellow RiverLOC ] crossing monument

(f) Word proceeding “monument” is LOCATION

Figure 2: Top four figures show performance curves of CLiPER with varying amounts of available labeled
training data in a weakly supervised setting. Vertical axes show the F1 score on the test set. Performance
curves of supervised CRF and “project-then-train” CRF are plotted for comparison. Bottom two figures are
examples of aligned sentence pairs in Chinese and English.

63


Chinese German
P R F1 P R F1

CRF 79.09 63.59 70.50 86.69 71.30 78.25
CRFptt 84.01 45.29 58.85 81.50 75.56 78.41
BPBK10 79.25 65.67 71.83 84.00 72.17 77.64
CWD13 81.31 65.50 72.55 85.99 72.98 78.95
WCD13a 80.31 65.78 72.33 85.98 72.37 78.59
WCD13b 78.55 66.54 72.05 85.19 72.98 78.62
CLiPERh 83.67 64.80 73.04§‡ 86.52 72.02 78.61∗

CLiPERs 82.57 65.99 73.35
§†?
�∗ 87.11 72.56 79.17

‡?
∗§

Table 2: Test set Chinese, German NER results.
Best number of each column is highlighted in
bold. CRF is the supervised baseline. CRFptt is
the “project-then-train” semi-supervised scheme for
CRF. BPBK10 is (Burkett et al., 2010), WCD13 is
(Wang et al., 2013a), CWD13A is (Che et al., 2013),
and WCD13B is (Wang et al., 2013b) . CLIPERs
and CLIPERh are the soft and hard projections. §
indicates F1 scores that are statistically significantly
better than CRF baseline at 99.5% confidence level;
? marks significance over CRFptt with 99.5% con-
fidence; † and ‡ marks significance over WCD13
with 99.9% and 94% confidence; and � marks sig-
nificance over CWD13 with 99.7% confidence; ∗
marks significance over BPBK10 with 99.9% con-
fidence.

5 Discussions

Figure 2e and 2f give two examples of cross-lingual
projection methods in action. Both examples have
a named entity that immediately proceeds the word
“纪念碑” (monument) in the Chinese sentence. In
Figure 2e, the word “高岗” has literal meaning of a
hillock located at a high position, which also hap-
pens to be the name of a former vice president of
China. Without having previously observed this
word as a person name in the labeled training data,
the CRF model does not have enough evidence to
believe that this is a PERSON, instead of LOCATION.
But the aligned words in English (“Gao Gang”) are
clearly part of a person name as they were pre-
ceded by a title (“Vice President”). The English
model has high expectation that the aligned Chi-
nese word of ”Gao Gang” is also a PERSON. There-
fore, projecting the English expectations to Chinese
provides a strong clue to help disambiguating this
word. Figure 2f gives another example: the word
“黄河”(Huang He, the Yellow River of China) can

Chinese German
CRF 19m30s 7m15s
CRFptt 34m2s 12m45s
WCD13 3h17m 1h1m
CWD13a 16h42m 4h49m
CWD13b 16h42m 4h49m
BPBK10 6h16m 2h42m
CLiPERh 1h28m 16m30s
CLiPERs 1h40m 18m51s

Table 3: Timing stats during model training.

be confused with a person name since “黄”(Huang
or Hwang) is also a common Chinese last name.18.
Again, knowing the translation in English, which
has the indicative word “River” in it, helps disam-
biguation.

The CRFptt and CLIPERh methods successfully
labeled these two examples correctly, but failed to
produce the correct label for the example in Fig-
ure 1. On the other hand, a model trained with the
CLIPERs method does correctly label both entities
in Figure 1, demonstrating the merits of the soft pro-
jection method.

6 Conclusion

We introduced a domain and language independent
semi-supervised method for training discriminative
models by projecting expectations across bitext. Ex-
periments on Chinese and German NER show that
our method, learned over bitext alone, can rival per-
formance of supervised models trained with thou-
sands of labeled examples. Furthermore, applying
our method in a setting where all labeled examples
are available also shows improvements over state-of-
the-art supervised methods. Our experiments also
showed that soft expectation projection is more fa-
vorable to hard projection. This technique can be
generalized to all sequence labeling tasks, and can
be extended to include more complex constraints.
For future work, we plan to apply this method to
more language pairs and also explore data selection
strategies and modeling alignment uncertainties.

18In fact, a people search of the name 黄河 on the most pop-
ular Chinese social network (renren.com) returns over 13,000
matches.

64


Acknowledgments

The authors would like to thank Jennifer Gillenwa-
ter for a discussion that inspired this work, Behrang
Mohit and Nathan Schneider for their help with the
Arabic NER data, and David Burkett for providing
the source code of their work for comparison. We
would also like to thank editor Lillian Lee and the
three anonymous reviewers for their valuable com-
ments and suggestions. We gratefully acknowledge
the support of the U.S. Defense Advanced Research
Projects Agency (DARPA) Broad Operational Lan-
guage Translation (BOLT) program through IBM.
Any opinions, findings, and conclusion or recom-
mendations expressed in this material are those of
the authors and do not necessarily reflect the view of
DARPA, or the US government.

References
Hiyan Alshawi, Srinivas Bangalore, and Shona Douglas.

2000. Head-transducer models for speech translation
and their automatic acquisition from bilingual data.
Machine Translation, 15.

Rie Kubota Ando and Tong Zhang. 2005. A high-
performance semi-supervised learning method for text
chunking. In Proceedings of ACL.

Kedar Bellare, Gregory Druck, and Andrew McCallum.
2009. Alternating projections for learning with expec-
tation constraints. In Proceedings of UAI.

Avrim Blum and Tom Mitchell. 1998. Combining la-
beled and unlabeled data with co-training. In Proceed-
ings of COLT.

David Burkett and Dan Klein. 2008. Two languages are
better than one (for syntactic parsing). In Proceedings
of EMNLP.

David Burkett, Slav Petrov, John Blitzer, and Dan Klein.
2010. Learning better monolingual models with unan-
notated bilingual text. In Proceedings of CoNLL.

Andrew Carlson, Justin Betteridge, Richard C. Wang, Es-
tevam R. Hruschka Jr., and Tom M. Mitchell. 2010.
Coupled semi-supervised learning for information ex-
traction. In Proceedings of WSDM.

Ming-Wei Chang, Lev Ratinov, and Dan Roth.
2007. Guiding semi-supervision with constraint-
driven learning. In Proceedings of ACL.

Wanxiang Che, Mengqiu Wang, and Christopher D. Man-
ning. 2013. Named entity recognition with bilingual
constraints. In Proceedings of NAACL.

Michael Collins and Yoram Singer. 1999. Unsupervised
models for named entity classification. In Proceedings
of EMNLP.

Dipanjan Das and Slav Petrov. 2011. Unsupervised part-
of-speech tagging with bilingual graph-based projec-
tions. In Proceedings of ACL.

Gregory Druck and Andrew McCallum. 2010. High-
performance semi-supervised learning using discrim-
inatively constrained generative models. In Proceed-
ings of ICML.

Gregory Druck, Gideon Mann, and Andrew McCallum.
2007. Leveraging existing resources using generalized
expectation criteria. In Proceedings of NIPS Workshop
on Learning Problem Design.

Gregory Druck, Burr Settles, and Andrew McCallum.
2009. Active learning by labeling features. In Pro-
ceedings of EMNLP.

Gregory Druck. 2011. Generalized Expectation Criteria
for Lightly Supervised Learning. Ph.D. thesis, Univer-
sity of Massachusetts Amherst.

Manaal Faruqui and Sebastian Padó. 2010. Training and
evaluating a German named entity recognizer with se-
mantic generalization. In Proceedings of KONVENS.

Jenny Rose Finkel, Christopher D. Manning, and An-
drew Y. Ng. 2006. Solving the problem of cascading
errors: Approximate bayesian inference for linguistic
annotation pipelines. In Proceedings of EMNLP.

Victoria Fossum and Steven Abney. 2005. Automatically
inducing a part-of-speech tagger by projecting from
multiple source languages across aligned corpora. In
Proceedings of IJCNLP.

Kuzman Ganchev and Dipanjan Das. 2013. Cross-
lingual discriminative learning of sequence models
with posterior regularization. In Proceedings of
EMNLP.

Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar.
2009. Dependency grammar induction via bitext pro-
jection constraints. In Proceedings of ACL.

Kuzman Ganchev, Jo ao Graça, Jennifer Gillenwater, and
Ben Taskar. 2010. Posterior regularization for struc-
tured latent variable models. JMLR, 10:2001–2049.

Andrew B. Goldberg. 2010. New Directions in Semi-
supervised Learning. Ph.D. thesis, University of
Wisconsin-Madison.

Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance
Ramshaw, and Ralph Weischedel. 2006. OntoNotes:
the 90% solution. In Proceedings of NAACL-HLT.

Dan Klein. 2005. The Unsupervised Learning of Natural
Language Structure. Ph.D. thesis, Stanford Univer-
sity.

John D. Lafferty, Andrew McCallum, and Fernando C. N.
Pereira. 2001. Conditional random fields: Probabilis-
tic models for segmenting and labeling sequence data.
In Proceedings of ICML.

65


Zhifei Li and Jason Eisner. 2009. First- and second-order
expectation semirings with applications to minimum-
risk training on translation forests. In Proceedings of
EMNLP.

Shen Li, Jo ao Graça, and Ben Taskar. 2012. Wiki-ly
supervised part-of-speech tagging. In Proceedings of
EMNLP-CoNLL.

Percy Liang. 2005. Semi-supervised learning for natural
language. Master’s thesis, Massachusetts Institute of
Technology.

Gideon Mann and Andrew McCallum. 2010. General-
ized expectation criteria for semi-supervised learning
with weakly labeled data. JMLR, 11:955–984.

David McClosky, Eugene Charniak, and Mark Johnson.
2006. Effective self-training for parsing. In Proceed-
ings of NAACL-HLT.

Tahira Naseem, Benjamin Snyder, Jacob Eisenstein,
and Regina Barzilay. 2009. Multilingual part-of-
speech tagging: Two unsupervised approaches. JAIR,
36:1076–9757.

Slav Petrov, Pi-Chuan Chang, Michael Ringgaard, and
Hiyan Alshawi. 2010. Uptraining for accurate deter-
ministic question parsing. In Proceedings of EMNLP.

Lance A. Ramshaw and Mitchell P. Marcus. 1999. Text
chunking using transformation-based learning. Natu-
ral Language Processing Using Very Large Corpora,
11:157–176.

Rajhans Samdani, Ming-Wei Chang, and Dan Roth.
2012. Unified expectation maximization. In Proceed-
ings of NAACL.

Erik F. Tjong Kim Sang and Fien De Meulder. 2003. In-
troduction to the CoNLL-2003 shared task: language-
independent named entity recognition. In Proceedings
of CoNLL.

Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin.
2005. A co-regularization approach to semi-
supervised learning with multiple views. In Proceed-
ings of ICML Workshop on Learning with Multiple
Views, International Conference on Machine Learn-
ing.

Noah A. Smith. 2006. Novel Estimation Methods for
Unsupervised Discovery of Latent Structure in Natu-
ral Language Text. Ph.D. thesis, Johns Hopkins Uni-
versity.

Benjamin Snyder, Tahira Naseem, and Regina Barzilay.
2009. Unsupervised multilingual grammar induction.
In Proceedings of ACL.

Jun Suzuki and Hideki Isozaki. 2008. Semi-supervised
sequential labeling and segmentation using giga-word
scale unlabeled data. In Proceedings of ACL.

Oscar Täckström, Dipanjan Das, Slav Petrov, Ryan Mc-
Donald, and Joakim Nivre. 2013. Token and type
constraints for cross-lingual part-of-speech tagging. In
Proceedings of ACL.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
Word representations: A simple and general method
for semi-supervised learning. In Proceedings of ACL.

Mengqiu Wang, Wanxiang Che, and Christopher D. Man-
ning. 2013a. Effective bilingual constraints for semi-
supervised learning of named entity recognizers. In
Proceedings of AAAI.

Mengqiu Wang, Wanxiang Che, and Christopher D. Man-
ning. 2013b. Joint word alignment and bilingual
named entity recognition using dual decomposition.
In Proceedings of ACL.

Chenhai Xi and Rebecca Hwa. 2005. A backoff model
for bootstrapping resources for non-english languages.
In Proceedings of HLT-EMNLP.

David Yarowsky and Grace Ngai. 2001. Inducing mul-
tilingual POS taggers and NP bracketers via robust
projection across aligned corpora. In Proceedings of
NAACL.

David Yarowsky. 1995. Unsupervised word sense dis-
ambiguation rivaling supervised methods. In Proceed-
ings of ACL.

66