Cross-lingual Projected Expectation Regularization for Weakly Supervised Learning Cross-lingual Projected Expectation Regularization for Weakly Supervised Learning Mengqiu Wang and Christopher D. Manning Computer Science Department Stanford University Stanford, CA 94305 USA {mengqiu,manning}@cs.stanford.edu Abstract We consider a multilingual weakly supervised learning scenario where knowledge from an- notated corpora in a resource-rich language is transferred via bitext to guide the learning in other languages. Past approaches project labels across bitext and use them as features or gold labels for training. We propose a new method that projects model expectations rather than labels, which facilities transfer of model uncertainty across language bound- aries. We encode expectations as constraints and train a discriminative CRF model using Generalized Expectation Criteria (Mann and McCallum, 2010). Evaluated on standard Chinese-English and German-English NER datasets, our method demonstrates F1 scores of 64% and 60% when no labeled data is used. Attaining the same accuracy with su- pervised CRFs requires 12k and 1.5k labeled sentences. Furthermore, when combined with labeled examples, our method yields signifi- cant improvements over state-of-the-art super- vised methods, achieving best reported num- bers to date on Chinese OntoNotes and Ger- man CoNLL-03 datasets. 1 Introduction Supervised statistical learning methods have en- joyed great popularity in Natural Language Process- ing (NLP) over the past decade. The success of su- pervised methods depends heavily upon the avail- ability of large amounts of annotated training data. Manual curation of annotated corpora is a costly and time consuming process. To date, most annotated re- sources resides within the English language, which hinders the adoption of supervised learning methods in many multilingual environments. To minimize the need for annotation, significant progress has been made in developing unsupervised and semi-supervised approaches to NLP (Collins and Singer 1999; Klein 2005; Liang 2005; Smith 2006; Goldberg 2010; inter alia) . More recent paradigms for semi-supervised learning allow mod- elers to directly encode knowledge about the task and the domain as constraints to guide learning (Chang et al., 2007; Mann and McCallum, 2010; Ganchev et al., 2010). However, in a multilingual setting, coming up with effective constraints require extensive knowledge of the foreign1 language. Bilingual parallel text (bitext) lends itself as a medium to transfer knowledge from a resource-rich language to a foreign languages. Yarowsky and Ngai (2001) project labels produced by an English tag- ger to the foreign side of bitext, then use the pro- jected labels to learn a HMM model. More recent work applied the projection-based approach to more language-pairs, and further improved performance through the use of type-level constraints from tag dictionary and feature-rich generative or discrimina- tive models (Das and Petrov, 2011; Täckström et al., 2013). In our work, we propose a new projection-based method that differs in two important ways. First, we never explicitly project the labels. Instead, we project expectations over the labels. This projection 1For experimental purposes, we designate English as the resource-rich language, and other languages of interest as “for- eign”. In our experiments, we simulate the resource-poor sce- nario using Chinese and German, even though in reality these two languages are quite rich in resources. 55 Transactions of the Association for Computational Linguistics, 2 (2014) 55–66. Action Editor: Lillian Lee. Submitted 9/2013; Revised 12/2013; Published 2/2014. c©2014 Association for Computational Linguistics. acts as a soft constraint over the labels, which al- lows us to transfer more information and uncertainty across language boundaries. Secondly, we encode the expectations as constraints and train a model by minimizing divergence between model expectations and projected expectations in a Generalized Expec- tation (GE) Criteria (Mann and McCallum, 2010) framework. We evaluate our approach on Named Entity Recognition (NER) tasks for English-Chinese and English-German language pairs on standard public datasets. We report results in two settings: a weakly supervised setting where no labeled data or a small amount of labeled data is available, and a semi- supervised settings where labeled data is available, but we can gain predictive power by learning from unlabeled bitext. 2 Related Work Most semi-supervised learning approaches embody the principle of learning from constraints. There are two broad categories of constraints: multi-view con- straints, and external knowledge constraints. Examples of methods that explore multi-view constraints include self-training (Yarowsky, 1995; McClosky et al., 2006),2 co-training (Blum and Mitchell, 1998; Sindhwani et al., 2005), multi- view learning (Ando and Zhang, 2005; Carlson et al., 2010), and discriminative and generative model combination (Suzuki and Isozaki, 2008; Druck and McCallum, 2010). An early example of using knowledge as con- straints in weakly-supervised learning is the work by Collins and Singer (1999). They showed that the addition of a small set of “seed” rules greatly im- prove a co-training style unsupervised tagger. Chang et al. (2007) proposed a constraint-driven learning (CODL) framework where constraints are used to guide the selection of best self-labeled examples to be included as additional training data in an iterative EM-style procedure. The kind of constraints used in applications such as NER are the ones like “the words CA, Australia, NY are LOCATION” (Chang et al., 2007). Notice the similarity of this partic- 2A multi-view interpretation of self-training is that the self- tagged additional data offers new views to learners trained on existing labeled data. ular constraint to the kinds of features one would expect to see in a discriminative MaxEnt model. The difference is that instead of learning the valid- ity (or weight) of this feature from labeled exam- ples — since we do not have them — we can con- strain the model using our knowledge of the domain. Druck et al. (2009) also demonstrated that in an ac- tive learning setting where annotation budget is lim- ited, it is more efficient to label features than ex- amples. Other sources of knowledge include lexi- cons and gazetteers (Druck et al., 2007; Chang et al., 2007). While it is straight-forward to see how resources such as a list of city names can give a lot of mileage in recognizing locations, we are also exposed to the danger of over-committing to hard constraints. For example, it becomes problematic with city names that are ambiguous, such as Augusta, Georgia.3 To soften these constraints, Mann and McCallum (2010) proposed the Generalized Expectation (GE) Criteria framework, which encodes constraints as a regularization term over some score function that measures the divergence between the model’s ex- pectation and the target expectation. The connection between GE and CODL is analogous to the relation- ship between hard (Viterbi) EM and soft EM, as il- lustrated by Samdani et al. (2012). Another closely related work is the Posterior Regularization (PR) framework by Ganchev et al. (2010). In fact, as Bellare et al. (2009) have shown, in a discriminative model these two methods opti- mize exactly the same objective.4 The two differ in optimization details: PR uses a EM algorithm to approximate the gradients which avoids the ex- pensive computation of a covariance matrix between features and constraints, whereas GE directly cal- culates the gradient. However, later results (Druck, 2011) have shown that using the Expectation Semir- ing techniques of Li and Eisner (2009), one can compute the exact gradients of GE in a Conditional Random Fields (CRF) (Lafferty et al., 2001) at costs 3This is a city in the state of Georgia in USA, famous for its golf courses. It is ambiguous since both Augusta and Georgia can also be used as person names. 4The different terminology employed by GE and PR may be confusing to discerning readers, but the “expectation” in the context of GE means the same thing as “marginal posterior” as in PR. 56 no greater than computing the gradients of ordinary CRF. And empirically, GE tends to perform more ac- curately than PR (Bellare et al., 2009; Druck, 2011). Obtaining appropriate knowledge resources for constructing constraints remain as a bottleneck in applying GE and PR to new languages. However, a number of past work recognizes parallel bitext as a rich source of linguistic constraints, naturally cap- tured in the translations. As a result, bitext has been effectively utilized for unsupervised multilin- gual grammar induction (Alshawi et al., 2000; Sny- der et al., 2009), parsing (Burkett and Klein, 2008), and sequence labeling (Naseem et al., 2009). A number of recent work also explored bilin- gual constraints in the context of simultaneous bilin- gual tagging, and showed that enforcing agreements between language pairs give superior results than monolingual tagging (Burkett et al., 2010; Che et al., 2013; Wang et al., 2013a). Burkett et al. (2010) also demonstrated a uptraining (Petrov et al., 2010) setting where tag-induced bitext can be used as ad- ditional monolingual training data to improve mono- lingual taggers. A major drawback of this approach is that it requires a readily-trained tagging models in each languages, which makes a weakly supervised setting infeasible. Another intricacy of this approach is that it only works when the two models have com- parable strength, since mutual agreements are en- forced between them. Projection-based methods can be very effective in weakly-supervised scenarios, as demonstrated by Yarowsky and Ngai (2001), and Xi and Hwa (2005). One problem with projected labels is that they are often too noisy to be directly used as training sig- nals. To mitigate this problem, Das and Petrov (2011) designed a label propagation method to au- tomatically induce a tag lexicon for the foreign lan- guage to smooth the projected labels. Fossum and Abney (2005) filter out projection noise by com- bining projections from from multiple source lan- guages. However, this approach is not always viable since it relies on having parallel bitext from multi- ple source languages. Li et al. (2012) proposed the use of crowd-sourced Wiktionary as additional re- sources for inducing tag lexicons. More recently, Täckström et al. (2013) combined token-level and type-level constraints to constrain legitimate label sequences and and recalibrate the probability distri- bution in a CRF. The tag dictionary used for POS tagging are analogous to the gazetteers and name lexicons used for NER by Chang et al. (2007). Our work is also closely related to Ganchev et al. (2009). They used a two-step projection method similar to Das and Petrov (2011) for dependency parsing. Instead of using the projected linguis- tic structures as ground truth (Yarowsky and Ngai, 2001), or as features in a generative model (Das and Petrov, 2011), they used them as constraints in a PR framework. Our work differs by project- ing expectations rather than Viterbi one-best labels. We also choose the GE framework over PR. Experi- ments in Bellare et al. (2009) and Druck (2011) sug- gest that in a discriminative model (like ours), GE is more accurate than PR. More recently, Ganchev and Das (2013) further extended this line of work to directly train discriminative sequence models us- ing cross lingual projection with PR. The types of constraints applied in this new work are similar to the ones in the monolingual PR setting proposed by Ganchev et al. (2010), where the total counts of la- bels of a particular kind are expected to match some fraction of the projected total counts. Our work dif- fer in that we enforce expectation constraints at to- ken level, which gives tighter guidance to learning the model. 3 Approach Given bitext between English and a foreign lan- guage, our goal is to learn a CRF model in the foreign language from little or no labeled data. Our method performs Cross-Lingual Projected Expectation Regularization (CLiPER). For every aligned sentence pair in the bitext, we first compute the posterior marginal at each word po- sition on the English side using a pre-trained English CRF tagger; then for each aligned English word, we project its posterior marginal as expectations to the aligned word position on the foreign side. Figure 1 shows a snippet of a sentence from real corpus. No- tice that if we were to directly project the Viterbi best assignment from English to Chinese, all three Chinese words that are named entities would have gotten the wrong tags. But projecting the English CRF model expectations preserves some uncertain- ties, informing the Chinese model that there is a 40% 57 a reception in Luobu Linka . . . . . . met with representatives of Zhongguo Ribao O:0.0032 O:0.0037 GPE:0.0000 GPE:0.0000PER:0.0000 PER:0.0000 PER:0.0000 GPE:0.0042 GPE:0.0042 LOC:0.0003 LOC:0.0003GPE:0.0000 GPE:0.0000 GPE:0.0000 ORG:0.0308 ORG:0.0307 O:0.0012 O:0.0011ORG:0.0000 ORG:0.0000 ORG:0.0000 LOC:0.3250 LOC:0.3256 ORG:0.4060 ORG:0.4061LOC:0.0000 LOC:0.0000 LOC:0.0000 PER:0.6369 PER:0.6377 PER:0.5925 PER:0.5925O:1.0000 O:1.0000 O:1.0000 在 罗布林卡 举行 的 招待会 . . . . . . 会见 了 中国 日报 代表 PER:0.6373 PER:0.5925 PER:0.5925O:1.0000 O:1.0000 O:1.0000 LOC:0.3253 ORG:0.4060 ORG:0.4061LOC:0.0000 LOC:0.0000 LOC:0.0000 ORG:0.0307 O:0.0012 O:0.0011ORG:0.0000 ORG:0.0000 ORG:0.0000 GPE:0.0042 LOC:0.0003 LOC:0.0003GPE:0.0000 GPE:0.0000 GPE:0.0000 O:0.0035 GPE:0.0000 GPE:0.0000PER:0.0000 PER:0.0000 PER:0.0000 Figure 1: Diagram illustrating the projection of model expectation from English to Chinese. The posterior probabilities assigned by the English CRF model is shown above each English word; automatically induced word alignments are shown in red; the correct projected labels for Chinese words are shown in green, and incorrect labels are shown in red. chance that “中国日报” (China Daily) is an organi- zation in this context. We would like to learn a CRF model in the for- eign language that has similar expectations as the projected expectations from English. To this end, we adopt the Generalized Expectation (GE) Crite- ria framework introduced by Mann and McCallum (2010). In the remainder of this section, we follow the notation used in (Druck, 2011) to explain our ap- proach. 3.1 CLiPER The general idea of GE is that we can express our preferences over models through constraint func- tions. A desired model should satisfy the imposed constraints by matching the expectations on these constraint functions with some target expectations (attained by external knowledge like lexicons or in our case transferred knowledge from English). We define a constraint function φi,lj for each word po- sition i and output label assignment lj. φi,lj = 0 is a constraint in that position i cannot take label lj. The set {l1, · · · , lm} denotes all possible label as- signment for each yi, and m is number of label val- ues. Ai is the set of English words aligned to Chi- nese word i. φi,lj are defined for all position i such that Ai 6= ∅. In other words, the constraint function applies only to Chinese word positions that have at least one aligned English word. Each φi,lj (y) can be treated as a Bernoulli random variable, and we concatenate the set of all φi,lj into a random vector φ(y), where φk = φi,lj if k = i∗m + j. We drop the (y) in φ for simplicity. The target expectation over φi,lj , denoted as φ̃i,lj , is the expectation of assigning label lj to English word Ai under the English conditional probability model. When multiple English words are aligned to the same foreign word, we average the expectations. The expectation over φ under a conditional prob- ability model P(y|x; θ) is denoted as EP(y|x;θ)[φ], and simplified as Eθ[φ] whenever it is unambigu- ous. The conditional probability model P(y|x; θ) in our case is defined as a standard linear-chain CRF:5 P(y|x; θ) = 1 Z(x; θ) exp ( n∑ i θf(x,yi,yi−1) ) where f is a set of feature functions; θ are the match- ing parameters to learn; n = |x|. The objective function to maximize in a standard CRF is the log probability over a collection of la- beled documents: LCRF (θ) = a′∑ a=1 log P(y∗a|xa; θ) (1) a′ is the number of labeled sentences. y∗ is an ob- served label sequence. The objective function to maximize in GE is de- fined as the sum over all unlabeled examples on the 5We simplify notation by dropping the L2 regularizer in the CRF definition, but apply it in our experiments. 58 foreign side of bitext, denoted as xb, over some cost function S between the model expectation over φ (Eθ[φ]) and the target expectation (φ̃). We choose S to be the negative L22 squared error sum6 defined as: LGE(θ) = n′∑ b=1 S ( EP(yb|xb;θ)[φ(yb)],φ̃b ) = n′∑ b=1 −‖φ̃b −Eθ[φ(yb)]‖22 (2) n′ is the total number of unlabeled bitext sentence pairs. When both labeled and bitext training data are available, the joint objective is the sum of Eqn. 1 and 2. Each is computed over the labeled training data and foreign half in the bitext, respectively. We can optimize this joint objective by comput- ing the gradients and use a gradient-based optimiza- tion method such as L-BFGS. Gradients of LCRF decomposes down to the gradients over each la- beled training example (x,y∗). Computing the gra- dient of LGE decomposes down to the gradients of S(EP(y|xb;θ[φ]) for each unlabeled foreign sentence x and the constraints over this example φ . The gra- dients can be calculated as: ∂ ∂θ S(Eθ[φ]) = − ∂ ∂θ ( φ̃−Eθ[φ] )T ( φ̃−Eθ[φ] ) = 2 ( φ̃−Eθ[φ] )T ( ∂ ∂θ Eθ[φ] ) We redefine the penalty vector u = 2 ( φ̃−Eθ[φ] ) to be u. ∂ ∂θ Eθ[φ] is a matrix where each column contains the gradients for a particular model feature θ with respect to all constraint functions φ. It can be 6In general, other loss functions such as KL-divergence can also be used for S. We found L22 to work well in practice. computed as: ∂ ∂θ Eθ[φ] = ∑ y φ(y) ∂ ∂θ P(y|x; θ) = ∑ y φ(y) ∂ ∂θ ( 1 Z(x; θ) exp(θT f(x,y)) ) = ∑ y φ(y) ( 1 Z(x; θ) ( ∂ ∂θ exp(θT f(x,y)) ) + exp(θT f(x,y)) ( ∂ ∂θ 1 Z(x; θ) )) = ∑ y φ(y) ( P(y|x; θ)f(x,y)T −P(y|x; θ) ∑ y′ P(y′|x; θ)f(x,y′)T ) = ∑ y P(y|x; θ) ∑ y φ(y)f(x,y)T − (∑ y P(y|x; θ)φ(y) )(∑ y P(y|x; θ)f(x,y)T ) = COVP(y|x;θ) (φ(y),f(x,y)) (3) = Eθ[φf T ]−Eθ[φ]Eθ[fT ] (4) Eqn. 3 gives the intuition of how optimization works in GE. In each iteration of L-BFGS, the model pa- rameters are updated according to their covariance with the constraint features, scaled by the differ- ence between current expectation and target expec- tation. The term Eθ[φfT ] in Eqn. 4 can be com- puted using a dynamic programming (DP) algo- rithm, but solving it directly requires us to store a matrix of the same dimension as fT in each step of the DP. We can reduce the complexity by using the same trick as in (Li and Eisner, 2009) for com- puting Expectation Semiring. The resulting algo- rithm has complexity O(nm2), which is the same as the standard forward-backward inference algorithm for CRF. (Druck, 2011, 93) gives full details of this derivation. 3.2 Hard vs. soft Projection Projecting expectations instead of one-best label as- signments from English to foreign language can be thought of as a soft version of the method de- scribed in (Das and Petrov, 2011) and (Ganchev et 59 al., 2009). Soft projection has its advantage: when the English model is not certain about its predic- tions, we do not have to commit to the current best prediction. The foreign model has more freedom to form its own belief since any marginal distribu- tion it produces would deviates from a flat distri- bution by just about the same amount. In general, preserving uncertainties till later is a strategy that has benefited many NLP tasks (Finkel et al., 2006). Hard projection can also be treated as a special case in our framework. We can simply recalibrate pos- terior marginal of English by assigning probability mass 1 to the most likely outcome, and zero ev- erything else out, effectively taking the argmax of the marginal at each word position. We refer to this version of expectation as the “hard” expecta- tion. In the hard projection setting, GE training re- sembles a “project-then-train” style semi-supervised CRF training scheme (Yarowsky and Ngai, 2001; Täckström et al., 2013). In such a training scheme, we project the one-best predictions of English CRF to the foreign side through word alignments, then in- clude the newly “tagged” foreign data as additional training data to a standard CRF in the foreign lan- guage. Rather than projecting labels on a per-word basis, Yarowsky and Ngai (2001) also explored an alternative method for noun-phrase (NP) bracketing task that amounts to projecting the spans of NPs based on the observation that individual NPs tend to retain their sequential spans across translations. We experimented with the same method for NER, but found that this method of projecting the NE spans does not help in reducing noise and actually lowers model performance. Besides the difference in projecting expecta- tions rather than hard labels, our method and the “project-then-train” scheme also differ by optimiz- ing different objectives: CRF optimizes maximum conditional likelihood of the observed label se- quence, whereas GE minimizes squared error be- tween model’s expectation and “hard” expectation based on the observed label sequence. In the case where squared error loss is replaced with a KL- divergence loss, GE has the same effect as marginal- izing out all positions with unknown projected la- bels, allowing more robust learning of uncertainties in the model. As we will show in the experimen- O PER LOC ORG GPE O 291339 391 141 1281 221 PER 1263 6721 5 56 73 LOC 409 23 546 123 133 ORG 2423 143 52 8387 196 GPE 566 239 69 668 6604 O PER LOC ORG MISC O 81209 24 38 155 103 PER 77 5725 41 69 10 LOC 49 40 3743 127 60 ORG 178 102 142 4075 91 MISC 175 41 30 114 1826 Table 1: Raw counts in the error confusion matrix of English CRF models. Top table contains the counts on OntoNotes test data, and bottom table contains CoNLL-03 test data counts. Rows are the true la- bels and columns are the observed labels. For exam- ple, item at row 2, column 3 of the top table reads: we observed 5 times where the true label should be PERSON, but English CRF model output label LO- CATION. tal results in Section 4.2, soft projection in combi- nation of the GE objective significantly outperforms the project-then-train style CRF training scheme. 3.3 Source-side noise An additional source of noise comes from errors generated by the source-side English CRF mod- els. We know that the English CRF models gives F1 score of 81.68% on the OntoNotes dataset for English-Chinese experiment, and 90.45% on the CoNLL-03 dataset for English-German experiment. We present a simple way of modeling English-side noise by picturing the following process: the la- bels assigned by the English CRF model (denoted as y) are some noised version of the true labels (de- noted as y∗). We can recover the probability of the true labels by marginalizing over the observed la- bels: P(y∗|x) = ∑ y P(y ∗|y)∗P(y|x). P(y|x) is the posterior probabilities given by the CRF model, and we can approximate P(y∗|y) by the column- normalized error confusion matrix shown in Table 1. This source-side noise model is likely to be overly simplistic. Generally speaking, we could build much more sophisticated noising model for the source- side, possibly conditioning on context, or capturing higher-order label sequences. 60 4 Experiments We conduct experiments on Chinese and German NER. We evaluate CLiPER in two learning set- tings: weakly supervised and semi-supervised. In the weakly supervised setting, we simulate the con- dition of having no labeled training data, and evalu- ate the model learned from bitext alone. We then vary the amount of labeled data available to the model, and examine the model’s learning curve. In the semi-supervised setting, we assume our model has access to the full labeled data; our goal is to improve performance of the supervised method by learning from additional bitext. 4.1 Dataset and setup We used the latest version of Stanford NER Toolkit7 as our base CRF model in all experiments. Fea- tures for English, Chinese and German CRFs are documented extensively in (Che et al., 2013) and (Faruqui and Padó, 2010) and omitted here for brevity. It it worth noting that the current Stan- ford NER models include recent improvements from semi-supervise learning approaches that induces dis- tributional similarity features from large word clus- ters. These models represent the current state-of- the-art in supervised methods, and serve as a very strong baseline. For Chinese NER experiments, we follow the same setup as Che et al. (2013) to evaluate on the latest OntoNotes (v4.0) corpus (Hovy et al., 2006).8 A total of 8,249 sentences from the parallel Chinese and English Penn Treebank portion 9 are reserved for evaluation. Odd-numbered documents are used as development set, and even-numbered documents are held out as blind test set. The rest of OntoNotes annotated with NER tags are used to train the En- glish and Chinese CRF base taggers. There are about 16k and 39k labeled sentences for Chinese and English training, respectively. The English CRF tag- ger trained on this training corpus gives F1 score of 81.68% on the OntoNotes test set. Four enti- ties types10 are used for both Chinese and English with a IO tagging scheme.11 The English-Chinese 7http://www-nlp.stanford.edu/ner 8LDC catalogue No.: LDC2011T03 9File numbers: chtb 0001-0325, ectb 1001-1078 10 PERSON, LOCATION, ORGANIZATION and GPE. 11We did not adopt the commonly seen BIO tagging scheme bitext comes from the Foreign Broadcast Informa- tion Service corpus (FBIS).12 We randomly sampled 80k parallel sentence pairs to use as bitext in our experiments. It is first sentence aligned using the Champollion Tool Kit,13 then word aligned with the BerkeleyAligner.14 For German NER experiments, we evaluate us- ing the standard CoNLL-03 NER corpus (Sang and Meulder, 2003). The labeled training set has 12k and 15k sentences, containing four entity types.15 An English CRF model is also trained on the CoNLL- 03 English data with the same entity types. For bi- text, we used a randomly sampled set of 40k parallel sentences from the de-en portion of the News Com- mentary dataset.16 The English CRF tagger trained on CoNLL-03 English training corpus gives F1 score of 90.4% on the CoNLL-03 test set. We report typed entity precision (P), recall (R) and F1 score. Statistical significance tests are done using a paired bootstrap resampling method with 1000 iterations, averaged over 5 runs. We com- pare against three recently approaches that were in- troduced in Section 2. They are: semi-supervised learning method using factored bilingual models with Gibbs sampling (Wang et al., 2013a); bilin- gual NER using Integer Linear Programming (ILP) with bilingual constraints, by (Che et al., 2013); and constraint-driven bilingual-reranking approach (Burkett et al., 2010). The code from (Che et al., 2013) and (Wang et al., 2013a) are publicly avail- able.17 Code from (Burkett et al., 2010) is obtained through personal communications. Since the objective function in Eqn. 2 is non- convex, we adopted the early stopping training scheme from (Turian et al., 2010) as the following: after each iteration in L-BFGS training, the model (Ramshaw and Marcus, 1999), because when projected across swapping word alignments, the “B-” and “I-” tag distinction may not be well-preserved and may introduce additional noise. 12The FBIS corpus is a collection of radio news casts and contains translations of openly available news and information from media sources outside the United States. The LDC cata- logue No. is LDC2003E14. 13champollion.sourceforge.net 14code.google.com/p/berkeleyaligner 15 PERSON, LOCATION, ORGANIZATION and MISCELLA- NEOUS. 16http://www.statmt.org/wmt13/ training-parallel-nc-v8.tgz 17https://github.com/stanfordnlp/CoreNLP 61 is evaluated against the development set; the train- ing procedure is terminated if no improvements have been made in 20 iterations. 4.2 Weakly supervised results Figure 2a and 2b show results of weakly supervised learning experiments. Quite remarkably, on Chinese test set, our proposed method (CLiPER) achieves a F1 score of 64.4% with 80k bitext, when no labeled training data is used. In contrast, the supervised CRF baseline would require as much as 12k labeled sentences to attain the same accuracy. Results on the German test set is less striking. With no labeled data and 40k of bitext, CLiPER performs at F1 of 60.0%, the equivalent of using 1.5k labeled examples in the supervised setting. When combined with 1k labeled examples, performance of CLiPER reaches 69%, a gain of over 5% absolute over supervised CRF. We also notice that supervised CRF model learns much faster in German than Chinese. This result is not too surprising, since it is well recognized that Chinese NER is more challenging than German or English. The best supervised results for Chinese is 10-20% (F1 score) behind best German and English super- vised results. Chinese NER relies more on lexical- ized features, and therefore needs more labeled data to achieve good coverage. The results suggest that CLiPER seems to be very effective at transferring lexical knowledge from English to Chinese. Figure 2c and 2d compares soft GE projection with hard GE projection and the “project-then-train” style CRF training scheme (cf. Section 3.2). We observe that both soft and hard GE projection sig- nificantly outperform the “project-then-train” style training scheme. The difference is especially pro- nounced on the Chinese results when fewer labeled examples are available. Soft projection gives better accuracy than hard projection when no labeled data is available, and also has a faster learning rate. Incorporating source-side noise using the method described in Section 3.3 gives a small improvement on Chinese with supervised data, increasing F1 score from 64.40% to 65.50%. This improvement is statis- tically significant at 92% confidence interval. How- ever, on the German data, we observe a tiny de- crease with no statistical significance in F1 score, dropping from 59.88% to 59.66%. A likely ex- planation of the difference is that the English CRF model in the English-Chinese experiment, which is trained on OntoNotes data, has a much higher error rate (18.32%) than the English CRF model in the English-German experiment trained on CoNLL-03 (9.55%). Therefore, modeling noise in the English- Chinese case is likely to have a greater effect than the English-German case. 4.3 Semi-supervised results In the semi-supervised experiments, we let the CRF model use the full set of labeled examples in addi- tion to the unlabeled bitext. Results on the test set are shown in Table 2. All semi-supervised baselines are tested with the same number of unlabeled bitext as CLiPER in each language. The “project-then- train” semi-supervised training scheme severely hurts performance on Chinese, but gives a small im- provement on German. Moreover, on Chinese it learns to achieve high precision but at a significant loss in recall. On German its behavior is the oppo- site. Such drastic and erratic imbalance suggest that this method is not robust or reliable. The other three semi-supervised baselines (row 3-5) all show im- provements over the CRF baseline, consistent with their reported results. CLIPERs gives the best re- sults on both Chinese and German, yielding statis- tically significant improvements over all baselines except for CWD13 on German. The hard projection version of CLiPER also gives sizable gain over CRF. However, in comparison, CLIPERs is superior. The improvements of CLIPERs over CRF on Chinese test set is over 2.8% in absolute F1. The improvement over CRF on German is almost a per- cent. To our knowledge, these are the best reported numbers on the OntoNotes Chinese and CoNLL-03 German datasets. 4.4 Efficiency Another advantage of our proposed approach is ef- ficiency. Because we eliminated the previous multi- stage “uptraining” paradigm, but instead integrating the semi-supervised and supervised objective into one joint objective, we are able to attain signifi- cant speed improvements over all methods except CRFptt. Table 3 shows the required training time. 62 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 10 20 30 40 50 60 70 80 # of labeled training sentences [k] F 1 sc or e [% ] supervised CRF CLiPPER soft (a) Chinese Test 0 1 2 3 4 5 6 7 8 9 10 11 12 0 10 20 30 40 50 60 70 80 # of labeled training sentences [k] F 1 sc or e [% ] supervised CRF CLiPPER soft (b) German Test 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 # of labeled training sentences [k] F 1 sc or e [% ] CRF projection CLiPPER hard CLiPPER soft (c) Soft vs. Hard on Chinese Test 0 1 2 3 4 5 6 7 8 9 10 11 12 54 56 58 60 62 64 66 68 70 72 74 76 78 80 # of labeled training sentences [k] F 1 sc or e [% ] CRF projection CLiPPER hard CLiPPER soft (d) Soft vs. Hard on German Test [高岗] 纪念碑 在 [横山] 落成 A monument commemorating [Vice President Gao GangPER ] was completed in [HengshanLOC ] (e) Word proceeding “monument” is PERSON [碛口] [毛主席] 东渡 [黄河] 纪念碑 简介 Introduction of [QikouLOC ] [Chairman MaoPER ] [Yellow RiverLOC ] crossing monument (f) Word proceeding “monument” is LOCATION Figure 2: Top four figures show performance curves of CLiPER with varying amounts of available labeled training data in a weakly supervised setting. Vertical axes show the F1 score on the test set. Performance curves of supervised CRF and “project-then-train” CRF are plotted for comparison. Bottom two figures are examples of aligned sentence pairs in Chinese and English. 63 Chinese German P R F1 P R F1 CRF 79.09 63.59 70.50 86.69 71.30 78.25 CRFptt 84.01 45.29 58.85 81.50 75.56 78.41 BPBK10 79.25 65.67 71.83 84.00 72.17 77.64 CWD13 81.31 65.50 72.55 85.99 72.98 78.95 WCD13a 80.31 65.78 72.33 85.98 72.37 78.59 WCD13b 78.55 66.54 72.05 85.19 72.98 78.62 CLiPERh 83.67 64.80 73.04§‡ 86.52 72.02 78.61∗ CLiPERs 82.57 65.99 73.35 §†? �∗ 87.11 72.56 79.17 ‡? ∗§ Table 2: Test set Chinese, German NER results. Best number of each column is highlighted in bold. CRF is the supervised baseline. CRFptt is the “project-then-train” semi-supervised scheme for CRF. BPBK10 is (Burkett et al., 2010), WCD13 is (Wang et al., 2013a), CWD13A is (Che et al., 2013), and WCD13B is (Wang et al., 2013b) . CLIPERs and CLIPERh are the soft and hard projections. § indicates F1 scores that are statistically significantly better than CRF baseline at 99.5% confidence level; ? marks significance over CRFptt with 99.5% con- fidence; † and ‡ marks significance over WCD13 with 99.9% and 94% confidence; and � marks sig- nificance over CWD13 with 99.7% confidence; ∗ marks significance over BPBK10 with 99.9% con- fidence. 5 Discussions Figure 2e and 2f give two examples of cross-lingual projection methods in action. Both examples have a named entity that immediately proceeds the word “纪念碑” (monument) in the Chinese sentence. In Figure 2e, the word “高岗” has literal meaning of a hillock located at a high position, which also hap- pens to be the name of a former vice president of China. Without having previously observed this word as a person name in the labeled training data, the CRF model does not have enough evidence to believe that this is a PERSON, instead of LOCATION. But the aligned words in English (“Gao Gang”) are clearly part of a person name as they were pre- ceded by a title (“Vice President”). The English model has high expectation that the aligned Chi- nese word of ”Gao Gang” is also a PERSON. There- fore, projecting the English expectations to Chinese provides a strong clue to help disambiguating this word. Figure 2f gives another example: the word “黄河”(Huang He, the Yellow River of China) can Chinese German CRF 19m30s 7m15s CRFptt 34m2s 12m45s WCD13 3h17m 1h1m CWD13a 16h42m 4h49m CWD13b 16h42m 4h49m BPBK10 6h16m 2h42m CLiPERh 1h28m 16m30s CLiPERs 1h40m 18m51s Table 3: Timing stats during model training. be confused with a person name since “黄”(Huang or Hwang) is also a common Chinese last name.18. Again, knowing the translation in English, which has the indicative word “River” in it, helps disam- biguation. The CRFptt and CLIPERh methods successfully labeled these two examples correctly, but failed to produce the correct label for the example in Fig- ure 1. On the other hand, a model trained with the CLIPERs method does correctly label both entities in Figure 1, demonstrating the merits of the soft pro- jection method. 6 Conclusion We introduced a domain and language independent semi-supervised method for training discriminative models by projecting expectations across bitext. Ex- periments on Chinese and German NER show that our method, learned over bitext alone, can rival per- formance of supervised models trained with thou- sands of labeled examples. Furthermore, applying our method in a setting where all labeled examples are available also shows improvements over state-of- the-art supervised methods. Our experiments also showed that soft expectation projection is more fa- vorable to hard projection. This technique can be generalized to all sequence labeling tasks, and can be extended to include more complex constraints. For future work, we plan to apply this method to more language pairs and also explore data selection strategies and modeling alignment uncertainties. 18In fact, a people search of the name 黄河 on the most pop- ular Chinese social network (renren.com) returns over 13,000 matches. 64 Acknowledgments The authors would like to thank Jennifer Gillenwa- ter for a discussion that inspired this work, Behrang Mohit and Nathan Schneider for their help with the Arabic NER data, and David Burkett for providing the source code of their work for comparison. We would also like to thank editor Lillian Lee and the three anonymous reviewers for their valuable com- ments and suggestions. We gratefully acknowledge the support of the U.S. Defense Advanced Research Projects Agency (DARPA) Broad Operational Lan- guage Translation (BOLT) program through IBM. Any opinions, findings, and conclusion or recom- mendations expressed in this material are those of the authors and do not necessarily reflect the view of DARPA, or the US government. References Hiyan Alshawi, Srinivas Bangalore, and Shona Douglas. 2000. Head-transducer models for speech translation and their automatic acquisition from bilingual data. Machine Translation, 15. Rie Kubota Ando and Tong Zhang. 2005. A high- performance semi-supervised learning method for text chunking. In Proceedings of ACL. Kedar Bellare, Gregory Druck, and Andrew McCallum. 2009. Alternating projections for learning with expec- tation constraints. In Proceedings of UAI. Avrim Blum and Tom Mitchell. 1998. Combining la- beled and unlabeled data with co-training. In Proceed- ings of COLT. David Burkett and Dan Klein. 2008. Two languages are better than one (for syntactic parsing). In Proceedings of EMNLP. David Burkett, Slav Petrov, John Blitzer, and Dan Klein. 2010. Learning better monolingual models with unan- notated bilingual text. In Proceedings of CoNLL. Andrew Carlson, Justin Betteridge, Richard C. Wang, Es- tevam R. Hruschka Jr., and Tom M. Mitchell. 2010. Coupled semi-supervised learning for information ex- traction. In Proceedings of WSDM. Ming-Wei Chang, Lev Ratinov, and Dan Roth. 2007. Guiding semi-supervision with constraint- driven learning. In Proceedings of ACL. Wanxiang Che, Mengqiu Wang, and Christopher D. Man- ning. 2013. Named entity recognition with bilingual constraints. In Proceedings of NAACL. Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. In Proceedings of EMNLP. Dipanjan Das and Slav Petrov. 2011. Unsupervised part- of-speech tagging with bilingual graph-based projec- tions. In Proceedings of ACL. Gregory Druck and Andrew McCallum. 2010. High- performance semi-supervised learning using discrim- inatively constrained generative models. In Proceed- ings of ICML. Gregory Druck, Gideon Mann, and Andrew McCallum. 2007. Leveraging existing resources using generalized expectation criteria. In Proceedings of NIPS Workshop on Learning Problem Design. Gregory Druck, Burr Settles, and Andrew McCallum. 2009. Active learning by labeling features. In Pro- ceedings of EMNLP. Gregory Druck. 2011. Generalized Expectation Criteria for Lightly Supervised Learning. Ph.D. thesis, Univer- sity of Massachusetts Amherst. Manaal Faruqui and Sebastian Padó. 2010. Training and evaluating a German named entity recognizer with se- mantic generalization. In Proceedings of KONVENS. Jenny Rose Finkel, Christopher D. Manning, and An- drew Y. Ng. 2006. Solving the problem of cascading errors: Approximate bayesian inference for linguistic annotation pipelines. In Proceedings of EMNLP. Victoria Fossum and Steven Abney. 2005. Automatically inducing a part-of-speech tagger by projecting from multiple source languages across aligned corpora. In Proceedings of IJCNLP. Kuzman Ganchev and Dipanjan Das. 2013. Cross- lingual discriminative learning of sequence models with posterior regularization. In Proceedings of EMNLP. Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. 2009. Dependency grammar induction via bitext pro- jection constraints. In Proceedings of ACL. Kuzman Ganchev, Jo ao Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for struc- tured latent variable models. JMLR, 10:2001–2049. Andrew B. Goldberg. 2010. New Directions in Semi- supervised Learning. Ph.D. thesis, University of Wisconsin-Madison. Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. OntoNotes: the 90% solution. In Proceedings of NAACL-HLT. Dan Klein. 2005. The Unsupervised Learning of Natural Language Structure. Ph.D. thesis, Stanford Univer- sity. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilis- tic models for segmenting and labeling sequence data. In Proceedings of ICML. 65 Zhifei Li and Jason Eisner. 2009. First- and second-order expectation semirings with applications to minimum- risk training on translation forests. In Proceedings of EMNLP. Shen Li, Jo ao Graça, and Ben Taskar. 2012. Wiki-ly supervised part-of-speech tagging. In Proceedings of EMNLP-CoNLL. Percy Liang. 2005. Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology. Gideon Mann and Andrew McCallum. 2010. General- ized expectation criteria for semi-supervised learning with weakly labeled data. JMLR, 11:955–984. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceed- ings of NAACL-HLT. Tahira Naseem, Benjamin Snyder, Jacob Eisenstein, and Regina Barzilay. 2009. Multilingual part-of- speech tagging: Two unsupervised approaches. JAIR, 36:1076–9757. Slav Petrov, Pi-Chuan Chang, Michael Ringgaard, and Hiyan Alshawi. 2010. Uptraining for accurate deter- ministic question parsing. In Proceedings of EMNLP. Lance A. Ramshaw and Mitchell P. Marcus. 1999. Text chunking using transformation-based learning. Natu- ral Language Processing Using Very Large Corpora, 11:157–176. Rajhans Samdani, Ming-Wei Chang, and Dan Roth. 2012. Unified expectation maximization. In Proceed- ings of NAACL. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. In- troduction to the CoNLL-2003 shared task: language- independent named entity recognition. In Proceedings of CoNLL. Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. 2005. A co-regularization approach to semi- supervised learning with multiple views. In Proceed- ings of ICML Workshop on Learning with Multiple Views, International Conference on Machine Learn- ing. Noah A. Smith. 2006. Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natu- ral Language Text. Ph.D. thesis, Johns Hopkins Uni- versity. Benjamin Snyder, Tahira Naseem, and Regina Barzilay. 2009. Unsupervised multilingual grammar induction. In Proceedings of ACL. Jun Suzuki and Hideki Isozaki. 2008. Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Proceedings of ACL. Oscar Täckström, Dipanjan Das, Slav Petrov, Ryan Mc- Donald, and Joakim Nivre. 2013. Token and type constraints for cross-lingual part-of-speech tagging. In Proceedings of ACL. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of ACL. Mengqiu Wang, Wanxiang Che, and Christopher D. Man- ning. 2013a. Effective bilingual constraints for semi- supervised learning of named entity recognizers. In Proceedings of AAAI. Mengqiu Wang, Wanxiang Che, and Christopher D. Man- ning. 2013b. Joint word alignment and bilingual named entity recognition using dual decomposition. In Proceedings of ACL. Chenhai Xi and Rebecca Hwa. 2005. A backoff model for bootstrapping resources for non-english languages. In Proceedings of HLT-EMNLP. David Yarowsky and Grace Ngai. 2001. Inducing mul- tilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of NAACL. David Yarowsky. 1995. Unsupervised word sense dis- ambiguation rivaling supervised methods. In Proceed- ings of ACL. 66