Edinburgh Research Explorer Adapting to All Domains at Once: Rewarding Domain Invariance in SMT Citation for published version: Hoang, C, Sima'an, K & Titov, I 2016, 'Adapting to All Domains at Once: Rewarding Domain Invariance in SMT', Transactions of the Association for Computational Linguistics, vol. 4, pp. 99-112. Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://transacl.org/ojs/index.php/tacl/article/view/768 https://www.research.ed.ac.uk/portal/en/publications/adapting-to-all-domains-at-once-rewarding-domain-invariance-in-smt(2cc73492-f3a1-43e2-95b0-5013d2376721).html Adapting to All Domains at Once: Rewarding Domain Invariance in SMT Hoang Cuong and Khalil Sima’an and Ivan Titov Institute for Logic, Language and Computation University of Amsterdam Science Park 107, 1098 XG Amsterdam, The Netherlands {c.hoang,k.simaan,titov}@uva.nl Abstract Existing work on domain adaptation for statis- tical machine translation has consistently as- sumed access to a small sample from the test distribution (target domain) at training time. In practice, however, the target domain may not be known at training time or it may change to match user needs. In such situations, it is natural to push the system to make safer choices, giving higher preference to domain- invariant translations, which work well across domains, over risky domain-specific alterna- tives. We encode this intuition by (1) in- ducing latent subdomains from the training data only; (2) introducing features which mea- sure how specialized phrases are to individual induced sub-domains; (3) estimating feature weights on out-of-domain data (rather than on the target domain). We conduct experiments on three language pairs and a number of differ- ent domains. We observe consistent improve- ments over a baseline which does not explic- itly reward domain invariance. 1 Introduction Mismatch in phrase translation distributions be- tween test data (target domain) and train data is known to harm performance of statistical transla- tion systems (Irvine et al., 2013; Carpuat et al., 2014). Domain-adaptation methods (Foster et al., 2010; Bisazza et al., 2011; Sennrich, 2012b; Raz- mara et al., 2012; Sennrich et al., 2013; Haddow, 2013; Joty et al., 2015) aim to specialize a system estimated on out-of-domain training data to a target domain represented by a small data sample. In prac- tice, however, the target domain may not be known at training time or it may change over time depend- ing on user needs. In this work we address exactly the setting where we have a domain-agnostic system but we have no access to any samples from the tar- get domain at training time. This is an important and challenging setting which, as far as we are aware, has not yet received attention in the literature. When the target domain is unknown at training time, the system could be trained to make safer choices, preferring translations which are likely to work across different domains. For example, when translating from English to Russian, the most natural translation for the word ‘code’ would be highly de- pendent on the domain (and the corresponding word sense). The Russian words ‘xifr’, ‘zakon’ or ‘programma’ would perhaps be optimal choices if we consider cryptography, legal and software devel- opment domains, respectively. However, the transla- tion ‘kod’ is also acceptable across all these domains and, as such, would be a safer choice when the tar- get domain is unknown. Note that such a transla- tion may not be the most frequent overall and, con- sequently, might not be proposed by a standard (i.e., domain-agnostic) phrase-based translation system. In order to encode preference for domain- invariant translations, we introduce a measure which quantifies how likely a phrase (or a phrase-pair) is to be “domain-invariant". We recall that most large parallel corpora are heterogeneous, consisting of di- verse language use originating from a variety of un- specified subdomains. For example, news articles may cover sports, finance, politics, technology and a variety of other news topics. None of the sub- domains may match the target domain particularly 99 Transactions of the Association for Computational Linguistics, vol. 4, pp. 99–112, 2016. Action Editor: Philipp Koehn. Submission batch: 11/2015; Revision batch: 2/2016; Published 4/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. well, but they can still reveal how domain-specific a given phrase is. For example, if we would ob- serve that the word ‘code’ can be translated as ‘kod’ across cryptography and legal subdomains observed in training data, we can hypothesize that it may work better on a new unknown domain than ‘zakon’ which was specific only to a single subdomain (le- gal). This would be a suitable decision if the test domain happens to be software development, even though no texts pertaining to this domain were in- cluded in the heterogeneous training data. Importantly, the subdomains are usually not spec- ified in the heterogeneous training data. Therefore, we treat the subdomains as latent, so we can induce them automatically. Once induced, we define mea- sures of domain specificity, particularly expressing two generic properties: Phrase domain specificity How specific is a target or a source phrase to some of the induced sub- domains? Phrase pair domain coherence How coherent is a source phrase and a target language translation across the induced subdomains? These features capture two orthogonal aspects of phrase behaviour in heterogeneous corpora, with the rationale that phrase pairs can be weighted along these two dimensions. Domain-specificity captures the intuition that the more specific a phrase is to certain subdomains, the less applicable it is in gen- eral. Note that specificity is applied not only to tar- get phrases (as ‘kod’ and ‘zakon’ in the above ex- ample) but also to source phrases. When applied to a source phrase, it may give a preference towards using shorter phrases as they are inherently less do- main specific. In contrast to phrase domain speci- ficity, phrase pair coherence reflects whether can- didate target and source phrases are typically used in the same set of domains. The intuition here is that the more divergent the distributional behaviour of source and target phrases across subdomains, the less certain we are whether this phrase pair is valid for the unknown target domain. In other words, a translation rule with source and target phrases hav- ing two similar distributions over the latent subdo- mains is likely safer to use. Weights for these features, alongside all other standard features, are tuned on a development set. Importantly, we show that there is no noteworthy benefit from tuning the weights on a sample from the target domain. It is enough to tune them on a mixed-domain dataset sufficiently different from the training data. We attribute this attractive prop- erty to the fact that our features, unlike the ones typically considered in standard domain-adaptation work, are generic and only affect the amount of risk our system takes. In contrast, for example, in Ei- delman et al. (2012), Chiang et al. (2011), Hu et al. (2014), Hasler et al. (2014), Su et al. (2015), Sen- nrich (2012b), Chen et al. (2013b), and Carpuat et al. (2014), features capture similarities between a target domain and each of the training subdomains. Clearly, domain adaptation with such rich features, though potentially more powerful, would not be pos- sible without a development set closely matching the target domain. We conduct our experiments on three language pairs and explore adaptation to 9 domain adapta- tion tasks in total. We observe significant and con- sistent performance improvements over the baseline domain-agnostic systems. This result confirms that our two features, and the latent subdomains they are computed from, are useful also for the very chal- lenging domain adaptation setting considered in this work. 2 Domain-Invariance for Phrases At the core of a standard state-of-the-art phrase- based system (Koehn et al., 2003; Och and Ney, 2004) lies a phrase table {〈ẽ, f̃〉} ex- tracted from a word-aligned training corpus together with estimates for phrase translation probabilities Pcount(ẽ| f̃) and Pcount(f̃| ẽ). Typically the phrases and their probabilities are obtained from large paral- lel corpora, which are usually broad enough to cover a mixture of several subdomains. In such mixtures, phrase distributions may be different across different subdomains. Some phrases (whether source or tar- get) are more specific for certain subdomains than others, while some phrases are useful across many subdomains. Moreover, for a phrase pair, the distri- bution over the subdomains for its source side may be similar or not to the distribution for its target side. 100 Source Phrase Projection domain.i... ...domain.1 domain.K Target Phrase Projection Figure 1: The projection framework of phrases into K- dimensional vector space of probabilistic latent subdo- mains. Coherent pairs seem safer to employ than pairs that exhibit different distributions over the subdomains. These two factors, domain specificity and domain coherence, can be estimated from the training cor- pus if we have access to subdomain statistics for the phrases. In the setting addressed here, the subdo- mains are not known in advance and we have to con- sider them latent in the training data. Therefore, we introduce a random variable z ∈ {1, . . . , K} encoding (arbitrary) K latent subdo- mains that generate each source and target phrase ẽ and f̃ of every phrase pair 〈ẽ, f̃〉. In the next Sec- tion, we aim to estimate distributions P(z| ẽ) and P(z| f̃) for subdomain z over the source and target phrases respectively. In other words, we aim at pro- jecting phrases onto a compact (K−1) dimensional simplex of subdomains with vectors: ~̃e = [ P(z = 1| ẽ), . . . , P(z = K| ẽ) ] , (1) ~̃ f = [ P(z = 1| f̃), . . . , P(z = K| f̃) ] . (2) Each of the K elements encodes how well each source and target phrase expresses a specific latent subdomain in the training data. See Fig. 1 for an illustration of the projection framework. Once the projection is performed, the hidden cross-domain translation behaviour of phrases and phrase pairs can be modeled as follows: • Domain-specificity of phrases: A rule with source and target phrases having a peaked distribution over latent subdomains is likely domain-specific. Technically speaking, entropy comes as a natural choice for quantifying domain specificity. Here, we opt for the Renyi entropy and define the do- main specificity as follows: Dα(~̃e) = 1 1−α log ( K∑ i=1 P(z = i| ẽ)α ) Dα( ~̃ f) = 1 1−α log ( K∑ i=1 P(z = i| f̃)α ) For convenience, we refer to Dα(·) as the domain specificity of a phrase. In this study, we choose the value of α as 2 which is the default choice (also known as the Collision entropy). • Source-target coherence across subdomains: A translation rule with source and target phrases having two similar distributions over the latent subdomains is likely safer to use. We use the Chebyshev distance for measuring the similarity between two distributions. The divergence of two vectors ~̃e and ~̃f is defined as follows D(~̃e, ~̃ f) = max i={1,...,K} ∣∣∣P(z = i|ẽ)−P(z = i|f̃) ∣∣∣ We refer to D(~̃e, ~̃f) as the phrase pair coherence across latent subdomains. We investigated some other similarities for phrase pair coherence (the Kullback-Leibler divergence and the Hellinger distance) but have not observed any noticeable improvements in the performance. We will discuss these experiments in the empirical sec- tion. Once computed for every phrase pair, the two measures Dα(~̃e), Dα( ~̃ f) D(~̃e, ~̃ f), will be integrated into a phrase-based SMT system as feature func- tions. 3 Latent Subdomain Induction We now present our approach for inducing latent subdomain distributions P(z| ẽ) and P(z| f̃) for ev- ery source and target phrases ẽ and f̃. In our exper- iments, we compare using our subdomain induction framework with relying on topic distributions pro- vided by a standard topic model, Latent Dirichlet Allocation (Blei et al., 2003). Note that unlike LDA we rely on parallel data and word alignments when 101 inducing domains. Our intuition is that latent vari- ables capturing regularities in bilingual data may be more appropriate for the translation task. Inducing these probabilities directly is rather dif- ficult as the task of designing a fully generative phrase-based model is known to be challenging.1 In order to avoid this, we follow Matsoukas et al. (2009) and Cuong and Sima’an (2014a) who “em- bed" such a phrase-level model into a latent subdo- main model that works at the sentence level. In other words, we associate latent domains with sentence pairs rather than with phrases, and use the posterior probabilities computed for the sentences with all the phrases appearing in the corresponding sentences. Given P(z| e, f) - a latent subdomain model given sentence pairs 〈e,f〉 - the estimation of P(z| ẽ) and P(z| f̃), for phrases ẽ and f̃, can be simplified by computing expectations z for all z ∈{1, . . . , K}: P(z = i|ẽ) = ∑ e,f P(z = i|e,f)c(ẽ; e)∑K i′=1 ∑ e,f P(z = i ′|e,f)c(ẽ; e) , P(z = i|f̃) = ∑ e,f P(z = i|e,f)c(f̃; f)∑K i′=1 ∑ e,f P(z = i ′|e,f)c(f̃; f) . Here, c(ẽ,e) is the count of a phrase ẽ in a sentence e in the training corpus. Latent subdomains for sentences. We now turn to describing our latent subdomain model for sen- tences. We assume the following generative story for sentence pairs: 1. generate the domain z from the prior P(z); 2. choose the generation direction: f-to-e or e-to- f, with equal probability; 3. if the e-to-f direction is chosen then generate the pair relying on P(e| z)P(f| e, z); 4. otherwise, use P(f| z)P(e| f, z). Formally, it is a uniform mixture of the genera- tive processes for the two potential translation di- 1Doing that requires incorporating into the model additional hidden variables encoding phrase segmentation (DeNero et al., 2006). This would significantly complicate inference (Mylon- akis and Sima’an, 2008; Neubig et al., 2011; Cohn and Haffari, 2013). rections.2 This generative story implies having two translation models (TMs) and two language mod- els (LMs), each augmented with latent subdomains. Now, the posterior P(z| e, f) can be computed as P(z| e, f) ∝ P(z) (1 2 P(e| z)P(f| e, z) + 1 2 P(f| z)P(e| f, z) ) . (3) As we aim for a simple approach, our TMs are computed through the introduction of hidden align- ments a and a′ in f-to-e and e-to-f directions re- spectively, in which P(f|e, z) = ∑ a P(f, a|e, z) and P(e| f, z) = ∑ a′ P(e, a ′| f, z). To make the marginalization of alignments tractable, we re- strict P(f, a| e, z) and P(e, a′| f, z) to the same assumptions as IBM Model 1 (Brown et al., 1993) (i.e., a multiplication of translation of lexical proba- bilities with respect to latent subdomains). We use standard nth-order Markov model for P(e| z) and P(f| z), in which P(e| z) = ∏ i P(ei| e i−1 i−n, z) and P(f| z) =∏ j P(fj| f j−1 j−n, z). Here, the notation e i−1 i−n and fj−1j−n is used to denote the history of length n for the source and target words ei and fj, respec- tively. Training. For training, we maximize the log- likelihood L of the data L = ∑ e,f log (∑ z P(z) (1 2 P(e|z) ∑ a P(f,a|e,z) + 1 2 P(f| z) ∑ a′ P(e,a′| f, z) )) . (4) As there is no closed-form solution, we use the expectation-maximization (EM) algorithm (Demp- ster et al., 1977). In the E-step, we compute the posterior distribu- 2Note that we effectively average between them which is reasonable, as there is no reason to give preference to any of them. 102 tions P(a, z| e, f) and P(a′, z| e, f) as follows P(a, z| e, f) ∝ P(z) ( P(e| z)P(f, a| e, z) + P(f| z) ∑ a′ P(e, a′| f, z) ) , (5) P(a′, z| e, f) ∝ P(z) ( P(e| z) ∑ a P(f, a| e, z) + P(f| z)P(e, a′| f, z) ) . (6) In the M-step, we use the posteriors P(a, z|e, f) and P(a′, z| e, f) to re-estimate parameters of both alignment models. This is done in a very similar way to estimation of the standard IBM Model 1. We use the posteriors to re-estimate LM parame- ters as follows P(ei|ei−11 ,z) ∝ ∑ e,f P(z|e,f)c(ei1; e), (7) P(fi|fi−11 ,z) ∝ ∑ e,f P(z|e,f)c(fi1; f). (8) To obtain better parameter estimates for word pre- dictions and avoid overfitting, we use smoothing in the M-step. In this work, we chose to apply expected Kneser-Ney smoothing technique (Zhang and Chi- ang, 2014) as it is simple and achieves state-of-the- art performance on the language modeling problem. Finally, P(z) can be simply estimated as follows P(z) ∝ ∑ e, f P(z| e, f) (9) Hierarchical Training. In practice, we found that training the full joint model leads to brittle perfor- mance, as EM is very likely to get stuck in bad lo- cal maxima. To address this difficulty, in our im- plementation, we start out by first jointly training P(z), P(e| z) and P(f| z). In this way in the E-step, we fix our model parameters and compute P(z| e, f) for every sentence pair: P(z| e, f) ∝ P(e| z)P(f| z)P(z). In the M-step, we use the pos- teriors to re-estimate the model parameters, as in Equations (7), (8) and (9). Once the model is trained, we fix the language modeling parameters and finally train the full model. This parallel latent subdomain language model is less expressive and, consequently, is less likely to get stuck in a local maximum. The LMs estimated in this way will then drive the full alignment model to- wards better configurations in the parameter space.3 In practice, this training scheme is particularly use- ful in case of learning a more fine-grained latent sub- domain model with larger K. 4 Experiments Training Data English French Sents 5.01M Words 103.39M 125.81M English Spanish Sents 4.00M Words 81.48M 89.08M English German Sents 4.07M Words 93.19M 88.48M Table 1: Data Preparation. 4.1 Data We conduct experiments with large-scale SMT sys- tems across a number of domains for three lan- guage pairs (English-Spanish, English-German and English-French). The datasets are summarized in Table 1. For English-Spanish, we run experiments with training data consisting of 4M sentence pairs collected from multiple resources within the WMT 2013 MT Shared Task. These include EuroParl (Koehn, 2005), Common Crawl Corpus, UN Cor- pus, and News Commentary. For English-German, our training data consists of 4.1M sentence pairs collected from the WMT 2015 MT Shared Task, in- cluding EuroParl, Common Crawl Corpus and News Commentary. Finally, for English-French, we train SMT systems on a corpus of 5M sentence pairs col- lected from the WMT 2015 MT Shared Task, includ- ing the 109 French-English corpus. We conducted experiments on 9 different domains (tasks) where the data was manually collected by a TAUS.4 Table 2 presents the translation tasks: each of the tasks deals with a specific domain, each of this task has presumably a very different relevance level 3This procedure can be regarded as a form of hierarchical estimation: we start with a simpler model and then use it to drive a more expressive model. Note that we also use P (z) estimated within the parallel latent subdomain LMs to initialize P (z) for the latent subdomain alignment model. 4https://www.taus.net/. 103 English French Professional &Business Services Dev Sents 2K Words 74.16K 83.85K Test Sents 5K Words 92.84K 105.05K Leisure, Tourism and Arts Dev Sents 2K Words 107.45K 117.16K Test Sents 5K Words 101.82K 114.76K English Spanish Professional &Business Services Dev Sents 2K Words 31.70K 34.62K Test Sents 5K Words 84.1K 93.4K Legal Dev Sents 2K Words 35.06K 38.78K Test Sents 5K Words 88.63K 102.71K Financials Dev Sents 2K Words 37.23K 42.89K Test Sents 5K Words 99.05K 109.81K English German Professional &Business Services Dev Sents 2K Words 80.49K 85.08K Test Sents 5K Words 79.75K 85.28K Legal Dev Sents 2K Words 50.54K 45.99K Test Sents 5K Words 124.93K 111.70K Computer Software Dev Sents 2K Words 40.24K 38.31K Test Sents 5K Words 102.71K 101.12K Computer Hardware Dev Sents 2K Words 37.40K 36.98K Test Sents 5K Words 103.29K 98.04K Table 2: Data and adaptation tasks. to the training data. In this way, we test the stability of our results across a wide range of target domains. 4.2 Systems We use a standard state-of-the-art phrase-based sys- tem. The Baseline system includes MOSES (Koehn et al., 2007) baseline feature functions, plus eight hi- erarchical lexicalized reordering model feature func- tions (Galley and Manning, 2008). The training data is first word-aligned using GIZA++ (Och and Ney, 2003) and then symmetrized with grow(-diag)- final-and (Koehn et al., 2003). We limit the phrase length to the maximum of seven words. The lan- guage models are interpolated 5-grams with Kneser- Ney smoothing, estimated by KenLM (Heafield et al., 2013) from a large monolingual corpus of nearly 2.1B English words collected within the WMT 2015 MT Shared Task. Finally, we use MOSES as a de- coder (Koehn et al., 2007). Our system is exactly the same as the base- line, plus three additional feature functions induced for the translation rules: two features for domain- specificity of phrases (both for the source side (Dα( ~̃ f)) and the target side (Dα(~̃e)), and one fea- ture for source-target coherence across subdomains (D(~̃e, ~̃f)). For the projection, we use K=12. We also explored different values for K, but have not observed significant difference in the scores. In our experiments we do one iteration of EM with paral- lel LMs (as described in Section 3), before contin- uing with the full model for three more iterations. We did not observe a significant improvement from running EM any longer. Finally, we use hard EM, as it has been found to yield better models than the standard soft EM on a number of different task (e.g., (Johnson, 2007)). In other words, instead of stan- dard ‘soft’ EM updates with phrase counts weighted according to the posterior P(z = i| e, f), we use the ‘winner-takes-all’ approach: P(z = i| ẽ)∝ ∑ 〈e,f〉 c(i; ẑ〈e,f〉)δ(ẽ; e), P(z = i| f̃) ∝ ∑ 〈e,f〉 c(i; ẑ〈e,f〉)δ(f̃; f). Here, ẑ〈e, f〉 is the “winning" latent subdomain for sentence pair 〈e, f〉: ẑ〈e, f〉 = argmax i∈{1, ..., K} P(z = i| e, f) In practice, we found that using this hard version leads to better performance.5 4.3 Alternative tuning scenarios In order to tune all systems, we use the k-best batch MIRA (Cherry and Foster, 2012). We report the translation accuracy with three metrics - BLEU 5A more principled alternative would be to use posterior reg- ularization (Ganchev et al., 2009). 104 Task System BLEU↑ /∆ METEOR↑ /∆ TER↓ /∆ English-French Professional & Business Services Baseline 21.4 28.8 60.0 Our System 21.5/+0.1 28.9/+0.1 59.7/-0.3 Leisure, Tourism and Arts Baseline 39.9 36.7 48.1 Our System 40.8/+0.9 37.1/+0.4 47.1/-1.0 English-Spanish Financials Baseline 32.5 37.1 45.6 Our System 32.8/+0.3 37.2/+0.1 45.4/-0.2 Professional & Business Services Baseline 24.4 31.7 54.9 Our System 24.8/+0.4 31.9/+0.2 54.8/-0.1 Legal Services Baseline 33.3 36.3 49.5 Our System 33.8/+0.5 36.5/+0.2 49.1/-0.4 English-German Computer Software Baseline 22.8 27.7 64.3 Our System 23.1/+0.3 27.8/+0.1 64.0/-0.3 Computer Hardware Baseline 20.5 27.7 61.2 Our System 20.9/+0.4 27.9/+0.2 61.1/-0.1 Professional & Business Services Baseline 15.3 25.4 69.2 Our System 15.7/+0.4 25.6/+0.2 68.6/-0.6 Legal Services Baseline 29.6 32.9 55.6 Our System 30.2/+0.6 33.3/+0.4 55.1/-0.5 Table 3: Adaptation results when tuning on the in-domain development set. The bold face indicates that the improvement over the baseline is significant. Task System BLEU↑ /∆ METEOR↑ /∆ TER↓ /∆ English-French Professional & Business Services Baseline 20.7 28.3 59.5 Our System 20.7/+0.0 28.4/+0.1 59.4/-0.1 Leisure, Tourism and Arts Baseline 39.7 37.0 48.6 Our System 40.6/+0.9 37.4/+0.4 47.4/-1.2 English-Spanish Financials Baseline 33.6 37.5 45.4 Our System 34.0/+0.4 37.7/+0.2 45.0/-0.4 Professional & Business Services Baseline 24.4 31.9 55.3 Our System 24.9/+0.5 32.0/+0.1 54.9/-0.4 Legal Services Baseline 32.4 35.8 49.0 Our System 32.9/+0.5 36.0/+0.2 48.8/-0.2 English-German Computer Software Baseline 23.2 27.6 63.4 Our System 23.5/+0.3 27.8/+0.2 63.0/-0.4 Computer Hardware Baseline 20.8 27.8 61.5 Our System 21.0/+0.2 28.0/+0.2 61.2/-0.3 Professional & Business Services Baseline 13.8 25.2 72.2 Our System 13.9/+0.1 25.3/+0.1 72.1/-0.1 Legal Services Baseline 29.3 32.7 55.2 Our System 29.9/+0.6 33.1/+0.4 54.6/-0.6 Table 4: Adaptation results when tuning on the mixed-domain development set. The bold face indicates that the improvement over the baseline is significant. (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2011) and TER (Snover et al., 2006). We mark an improvement as significant when we ob- tain the p-level of 5 % under paired bootstrap resam- pling (Koehn, 2004). Note that better results corre- spond to larger BLEU and METEOR but to smaller TER. For every system reported, we run the opti- mizer at least three times, before running MultEval 105 (Clark et al., 2011) for resampling and significance testing. Note that the scores for the systems are av- erages over multiple runs. For tuning the systems we explore two kinds of development sets: (1) An in-domain develop- ment set of in-domain data that directly exempli- fies the translation task (i.e., a sample of target- domain data), and (2) a mixed-domain development set which is a full concatenation of development sets from all the available domains for a language pair; this scenario is a more realistic one when no in- domain data is available. In the analysis section we also test these two scenarios against the scenario mixed-domain minus in-domain, which excludes the in-domain development set part from the mixed- domain development set. By exploring the three dif- ferent development sets we hope to shed light on the importance of having samples from the target do- main when using our features. If our features can indeed capture domain invariance of phrases then they should improve the performance in all three set- tings, including the most difficult setting where the in-domain data has been explicitly excluded from the tuning phase. 4.4 Main results In-domain tuning scenario. Table 3 presents the results for the in-domain development set scenario. The integration of the domain-invariant feature functions into the baseline results in a significant improvement across all domains: average +0.50 BLEU on two adaptation tasks for English-French, +0.40 BLEU on three adaptation tasks for English- Spanish and +0.43 BLEU on four adaptation tasks for English-German. Mixed-domain tuning scenario. While the im- provements are robust and consistent for the in- domain development set scenario, we are espe- cially delighted to see a similar improvement for the mixed-domain tuning scenario (Table 4). In de- tail, we observe an average +0.45 BLEU on two adaptation tasks for English-French, +0.47 BLEU on three adaptation tasks for English-Spanish and +0.30 BLEU on four adaptation tasks for English- German. We would like to emphasize that this performance improvement is obtained without tun- ing specifically for the target domain or using other domain-related meta-information in the training cor- pus. Additional analysis We investigate the individual contribution of each domain-invariance feature. We conduct experiments using a basic large-scale phrase-based system de- scribed in Koehn et al. (2003) as a baseline. The baseline includes two bi-directional phrase-based models (Pcount(ẽ| f̃) and Pcount(f̃| ẽ)), three penal- ties for word, phrase and distortion, and finally, the language model. On top of the baseline, we build four different systems, each augmented with a domain-invariance feature. The first feature is the source-target coherence feature, D(ẽ, f̃), where we use the Chebyshev distance as our default options. We also investigate the performance of other metrics including the Hellinger distance,6 and the Kullback- Leibler divergence.7 Our second and third features are the domain specificity of phrases on the source Dα(f̃) and on the target Dα(ẽ) sides. Finally, we also deploy all these three domain-invariance fea- tures Dα(f̃) + Dα(ẽ) + D(ẽ, f̃)). The experi- ments are conducted for the task Legal on English- German. English-German (Task: Legal) Dev System BLEU↑ In- domain Baseline 28.8 +D(ẽ, f̃) 29.1/+0.3 +Dα(ẽ) 29.4/+0.6 +Dα(f̃) 29.8/+1.0 +Dα(f̃) + Dα(ẽ) + D(ẽ, f̃) 29.9/+1.1 Mixed- domains Baseline 28.5 +D(ẽ, f̃) 28.8/+0.3 +Dα(ẽ) 29.3/+0.8 +Dα(f̃) 29.6/+1.1 +Dα(f̃) + Dα(ẽ) + D(ẽ, f̃) 29.8/+1.3 Mixed- domains (Exclude Legal) Baseline 28.3 +D(ẽ, f̃) 28.6/+0.3 +Dα(ẽ) 29.1/+0.8 +Dα(f̃) 29.5/+1.2 +Dα(f̃) + Dα(ẽ) + D(ẽ, f̃) 29.6/+1.3 Table 5: Improvements over the baseline. The bold fact indicates that the difference is statistically sig- nificant. 6DH(~̃e, ~̃ f) = 1√ 2 √ ∑ z (√ P (z| ẽ) − √ P (z| f̃) )2 . 7DKL(~̃e, ~̃ f) = ∑ z P (z| ẽ) log P(z| ẽ) P(z| f̃) ; DKL( ~̃ f, ~̃e) = ∑ z P (z| f̃) log P(z| f̃) P(z| ẽ) . 106 German-English (Task: Legal Services) Input im jahr 2004 befindet der rat über die verpflichtung der elektronischen übertragung solcher aufzeichnungen. Reference the council shall decide in 2004 on the obligation to transmit such records electronically. Baseline in 2004 the council is the obligation on the electronic transfer of such records. + Dα(f̃) in 2004 the council is on the obligation of electronic transfer of such records. + Dα(ẽ) in 2004 the council is on the obligation of electronic transmission of such records. + D(ẽ, f̃) in 2004 the council is on the obligation of electronic transmission of such records. + ALL in 2004 the council is on the obligation of electronic transmission of such records. Input die angemessenheit und wirksamkeit der internen verwaltungssysteme sowie die leistung der dienststellen Reference for assessing the suitability and effectiveness of internal management systems and the performance of de- partments Baseline the adequacy and effectiveness of internal administrative systems as well as the performance of the services + Dα(f̃) the adequacy and effectiveness of the internal management systems, as well as the performance of the services + Dα(ẽ) the adequacy and effectiveness of internal management systems, as well as the performance of the services + D(ẽ, f̃) the adequacy and effectiveness of the internal administrative systems as well as the performance of the services + ALL the adequacy and effectiveness of internal management systems, as well as the performance of the services Input zur ausführung der ausgaben nimmt der anweisungsbefugte mittelbindungen vor, geht rechtliche verpflich- tungen ein Reference to implement expenditure, the authorising officer shall make budget commitments and legal commitments Baseline the implementation of expenditure, the authorising officer commitments before, is a legal commitments + Dα(f̃) the implementation of expenditure, the authorising officer commitments, is a legal obligations + Dα(ẽ) the implementation of expenditure, the authorising officer commitments before, is a legal obligations + D(ẽ, f̃) the implementation of expenditure, the authorising officer commitments before, is a legal commitments + ALL the implementation of expenditure, the authorising officer commitments before, is a legal obligations Table 7: Translation outputs produced by the basic baseline and its augmented systems with additional abstract feature functions derived from hidden domain information. English-German (Task: Legal) Dev Metric BLEU↑ In-domain Chebyshev 29.1/+0.3 Kullback-Leibler (DKL(~̃e, ~̃ f)) 29.2/+0.4 Kullback-Leibler (DKL( ~̃ f, ~̃e)) 29.0/+0.2 Hellinger 29.0/+0.2 Table 6: Using different metrics as the measure of coherence. Table 5 and Table 6 present the results. Overall, we can see that all domain-invariance features con- tribute to adaptation performance. Specifically, we observe the following: • Favouring the source-target coherence across sub- domains (i.e., adding the feature D(ẽ, f̃)) pro- vides a significant translation improvement of +0.3 BLEU. Which specific similarity measure is used does not seem to matter that much (see Table 6). We obtain the best result (+0.4 BLEU) with the KL divergence (DKL(~̃e, ~̃ f)). However, the differences are not statistically significant. • Integrating a preference for less domain-specific translation phrases at the target side (Dα(ẽ)) leads to a translation improvement of +0.6 BLEU. • Doing the same for the source side (Dα(f̃)), in turn, leads to an improvement of +1.0 BLEU. • Augmenting the baseline by integrating all our features leads to the best result, with an improve- ment of +1.1 BLEU. • The translation improvement is observed also for training with a development set of mixed domains (even for the mixed-domain minus in-domain setting when excluding the Legal data from the mixed development set). • The weights for all domain-invariance features, once tuned, are positive in all the experiments. Table 7 presents examples of translations from different systems. For example, the domain- invariant system revises the translation from "elec- tronic transfer" to "electronic transmission" for the 107 English-German Task Baseline Our System +z1 +z2 +z3 +z4 +z5 +z6 +z7 +z8 +z9 +z10 +z11 +z12 Hardware 20.2 20.4 20.4 20.4 20.5 20.5 20.5 20.4 20.4 20.5 20.4 20.4 20.4 Software 22.8 23.0 23.0 23.0 22.8 22.9 23.1 23.0 23.0 23.0 23.0 23.0 22.8 P&B Services 13.3 13.6 13.6 13.3 13.5 13.6 13.6 13.5 13.5 13.6 13.5 13.6 13.5 Legal 28.5 28.7 28.6 29.1 28.7 28.6 28.9 28.8 28.8 28.9 28.6 28.6 28.8 Table 8: Latent Subdomain Analysis (with BLEU score). German phrase "elektronischen Übertragung", and from "internal administrative systems" to "internal management systems" for the German phrase "in- ternen verwaltungssysteme". The revisions, how- ever, are not always successful. For instance, adding Dα(ẽ) and Dα(f̃) resulted in revising the translation of the German phrase "rechtliche verpflichtungen" to "legal obligations", which is a worse choice (at least according to BLEU) than "legal commitments" pro- duced by the baseline. We also present a brief analysis of latent subdomains induced by our projection frame- work. For each subdomain z we integrate the domain posteriors (P(z| ẽ) and P(z| f̃) and the source-target domain-coherence feature∣∣∣P(z| ẽ)−P(z| f̃) ∣∣∣). We hypothesize that when- ever we observe an improvement for a translation task with domain-informed features, this means that the corresponding latent subdomain z is close to the target translation domain. The results are presented in Table 8. Apparently, among the latent subdomains, z4, z5, z6, z9 are closest to the target domain of Hardware. Their derived feature functions are helpful in improving the translation accuracy for the task. Similarly, z1, z2, z5, z6, z9 and z11 are closest to Professional & Business, z6 is closest to Software, and z3 is closest to Legal. Meanwhile, z4, z5 and z12 are not relevant to the task of Software. Similarly, z3 is not relevant to Professional & Business, and z2, z5 and z10 are not relevant to Legal. Using topic models instead of latent domains. Our domain-invariance framework demands access to posterior distributions of latent domains for phrases. Though we argued for using our domain induction approach, other latent variable models can be used to compute these posteriors. One natural option is to use topic models, and more specifically LDA (Blei et al., 2003). Will our domain-invariance framework still work with topic models, and how closely related are the induced latent domains induced with LDA and our model? These are the questions we study in this section. We estimate LDA at the sentence level in a mono- lingual regime8 on one side of each parallel corpus (let us assume for now that this is the source side). When the model is estimated, we obtain the pos- terior distributions of topics (we denote them as z, as we treat them as domains) for each source-side sentence in the training set. Now, as we did with our phrase induction framework, we associate these posteriors with every phrase both in the source and in the target sides of that sentence pair. Phrase and phrase-pair features defined in Section 2 are com- puted relying on these probabilities averaged over the entire training set. We try both directions, that is also estimating LDA on the target side and transfer- ring the posterior probabilities to the source side. In order to estimate LDA, we used Gibbs sam- pling implemented in the Mallet package (McCal- lum, 2002) with default values of hyper-parameters (α = 0.01 and β = 0.01). Table 9 presents the results for the Legal task with three different sys- tem optimization settings. BLEU, METEOR and TER are reported. As the result suggests, using our induction framework tends to yield slightly better translation results in terms of METEOR and espe- cially BLEU. However, using LDA seems to lead to slightly better translation result in terms of TER. Topics in LDA-like models encode co-occurrence patterns in bag-of-word representations of sen- tences. In contrast, domains in our domain- induction framework rely on ngrams and word- alignment information. Consequently, these mod- 8Note that bilingual LDA models (e.g., see Hasler et al. (2014), Zhang et al. (2014)) could potentially produce better results but we leave them for future work. 108 English-German (Task: Legal) Dev Algorithms BLEU↑ METEOR↑ TER↓ In-domain Our 29.9 33.1 55.5 LDA (source) 29.9 33.1 55.4 LDA (target) 29.9 33.1 55.3 Mixed-domains Our 29.8 32.9 54.9 LDA (source) 29.7 32.9 54.8 LDA (target) 29.7 32.9 54.8 Mixed-domains (Exclude Legal) Our 29.6 32.8 54.6 LDA (source) 29.4 32.7 54.5 LDA (target) 29.4 32.7 54.6 Table 9: Comparison in latent domain induction with various algorithms. els are likely to encode different latent information about sentences. We also investigate translation per- formance when we use both coherence features from LDA and coherence features from our own frame- work. Table 10 shows that using all the induced co- herence features results in the best translation, no matter which translation metric is used. We leave the exploration of such an extension for future work. English-German (Task: Legal) Dev Algorithms BLEU↑ METEOR↑ TER↓ Mixed domains Our features 29.8 32.9 54.9 LDA (source) features 29.7 32.9 54.8 All Features 29.8 33.0 54.7 Table 10: Combination of all features. 5 Related Work and Discussion Domain adaptation is an important challenge for many NLP problems. A good survey of potential translation errors in MT adaptation can be found in Irvine et al (2013). Lexical selection appears to be the most common source of errors in domain adap- tation scenarios (Irvine et al., 2013; Wees et al., 2015). Other translation errors include reordering errors (Chen et al., 2013a; Zhang et al., 2015), align- ment errors (Cuong and Sima’an, 2015) and overfit- ting to the source domain at the parameter tuning stage (Joty et al., 2015). Adaptation in SMT can be regarded as injecting prior knowledge about the target translation task into the learning process. Various approaches have so far been exploited in the literature. They can be loosely categorized according to the type of prior knowledge exploited for adaptation. Often, a seed in-domain corpus exemplifying the target translation task is used as a form of prior knowledge. Various techniques can then be used for adaptation. For ex- ample, one approach is to combine a system trained on the in-domain data with another general-domain system trained on the rest of the data (e.g., see Koehn and Schroeder (2007), Foster et al. (2010), Bisazza et al. (2011), Sennrich (2012b), Razmara et al. (2012), Sennrich et al. (2013), Haddow (2013), Joty et al. (2015)). Rather than using the entire training data, it is also common to combine the in-domain system with a system trained on a selected subset of the data (e.g., see Axelrod et al. (2011), Koehn and Haddow (2012), Duh et al. (2013), Kirchhoff and Bilmes (2014), Cuong and Sima’an (2014b)). In some other cases, the prior knowledge lies in meta-information about the training data. This could be document-annotated training information (Eidel- man et al., 2012; Hu et al., 2014; Hasler et al., 2014; Su et al., 2015; Zhang et al., 2014), and domain- annotated sub-corpora (Chiang et al., 2011; Sen- nrich, 2012b; Chen et al., 2013b; Carpuat et al., 2014; Cuong and Sima’an, 2015). Some recent ap- proaches perform adaptation by exploiting a target domain development, or even only the source side of the development set (Sennrich, 2012a; Carpuat et al., 2013; Carpuat et al., 2014; Mansour and Ney, 2014). Recently, there was some research on adapting si- multaneously to multiple domains, the goal related to ours (Clark et al., 2012; Sennrich, 2012a). For instance, Clark et al. (2012) augment a phrase-based MT system with various domain indicator features to build a single system that performs well across a range of domains. Sennrich (2012a) proposed to cluster training data in an unsupervised fashion to build mixture models that yield good performance on multiple test domains. However, their approaches are very different from ours, that is minimizing risk associated with choosing domain-specific transla- tions. Moreover, the present work deviates radically from earlier work in that it explores the scenario where no prior data or knowledge is available about the translation task during training time. The focus of our approach is to aim for safer translation by re- warding domain-invariance of translation rules over latent subdomains that can be (still) useful on adap- 109 tation tasks. The present study is inspired by Zhang et al. (2014) which exploits topic-insensitivity that is learned over documents for translation. The goal and setting we are working on is markedly differ- ent (i.e., we do not have access to meta-information about the training and translation tasks at all). The domain-invariance induced is integrated into SMT systems as feature functions, redirecting the decoder to a better search space for the translation over adap- tation tasks. This aims at biasing the decoder to- wards translations that are less domain-specific and more source-target domain coherent. There is an interesting relation between this work and extensive prior work on minimum Bayes risk (MBR) objectives (used either at test time (Kumar and Byrne, 2004) or during training (Smith and Eis- ner, 2006; Pauls et al., 2009)). As with our work, the goal of MBR minimization is to select transla- tions that are less “risky". Their risk is due to the uncertainty in model predictions, and some of this uncertainty may indeed be associated with domain- variability of translations. Still, a system trained with an MBR objective will tend to output most frequent translation rather than the most domain- invariant one, and this, as we argued in the introduc- tion, might not be the right decision when applying it across domains. We believe that the two classes of methods are largely complementary, and leave fur- ther investigation for future work. At a conceptual level it is also related to regular- izers used in learning domain-invariant neural mod- els (Titov, 2011), specifically autoencoders. Though they also consider divergences between distributions of latent variable vectors, they use these divergences at learning time to bias models to induce represen- tations maximally invariant across domains. More- over, they assume access to meta-information about domains and consider only classification problems. 6 Conclusion This paper aims at adapting machine translation sys- tems to all domains at once by favoring phrases that are domain-invariant, that are safe to use across a variety of domains. While typical domain adapta- tion systems expect a sample of the target domain, our approach does not require one and is directly applicable to any domain adaptation scenario. Ex- periments show that the proposed approach results in modest but consistent improvements in BLEU, METEOR and TER. To the best of our knowledge, our results are the first to suggest consistent and sig- nificant improvement by a fully unsupervised adap- tation method across a wide variety of translation tasks. The proposed adaptation framework is fairly sim- ple, leaving much space for future research. One potential direction is the introduction of additional features relying on the assignment of phrases to do- mains. The framework for inducing latent domains proposed in this paper should be beneficial in this future work. The implementation of our subdomain- induction framework is available at https:// github.com/hoangcuong2011/UDIT. Acknowledgements We thank anonymous reviewers for their construc- tive comments on earlier versions. We also thank Hui Zhang for his help on expected Kneser-Ney smoothing technique. The first author is supported by the EXPERT (EXPloiting Empirical appRoaches to Translation) Initial Training Network (ITN) of the European Union’s Seventh Framework Programme. The second author is supported by VICI grant nr. 277-89-002 from the Netherlands Organization for Scientific Research (NWO). We thank TAUS for providing us with suitable data. References Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selec- tion. In Proceedings of EMNLP. Arianna Bisazza, Nick Ruiz, and Marcello Federico. 2011. Fill-up versus interpolation methods for phrase- based SMT adaptation. In IWSLT. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. JMLR. Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathemat- ics of statistical machine translation: parameter esti- mation. Comput. Linguist. Marine Carpuat, Hal Daume III, Katharine Henry, Ann Irvine, Jagadeesh Jagarlamudi, and Rachel Rudinger. 2013. Sensespotting: Never let your parallel data tie you to an old domain. In Proceedings of ACL. 110 Marine Carpuat, Cyril Goutte, and George Foster. 2014. Linear mixture models for robust machine translation. In Proc. of WMT. Boxing Chen, George Foster, and Roland Kuhn. 2013a. Adaptation of reordering models for statistical ma- chine translation. In Proceedings of NAACL. Boxing Chen, Roland Kuhn, and George Foster. 2013b. Vector space model for adaptation in statistical ma- chine translation. In Proceedings of the ACL. Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Pro- ceedings of the NAACL-HLT. David Chiang, Steve DeNeefe, and Michael Pust. 2011. Two easy improvements to lexical weighting. In Pro- ceedings of ACL (Short Papers). Jonathan Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer insta- bility. In Proceedings of ACL (Short Papers). Jonathan Clark, Alon Lavie, and Chris Dyer. 2012. One system, many domains: Open-domain statistical ma- chine translation via feature augmentation. Trevor Cohn and Gholamreza Haffari. 2013. An infinite hierarchical bayesian model of phrasal translation. In Proceedings of the ACL. Hoang Cuong and Khalil Sima’an. 2014a. Latent do- main phrase-based models for adaptation. In Proceed- ings of EMNLP. Hoang Cuong and Khalil Sima’an. 2014b. Latent do- main translation models in mix-of-domains haystack. In Proceedings of COLING. Hoang Cuong and Khalil Sima’an. 2015. Latent domain word alignment for heterogeneous corpora. In Pro- ceedings of NAACL-HLT. Arthur Dempster, Nan Laird, and Donald Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. JRSS, SERIES B, 39(1):1–38. John DeNero, Dan Gillick, James Zhang, and Dan Klein. 2006. Why generative phrase models underperform surface heuristics. In Proc. of WMT. Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: Automatic metric for reliable optimization and evalua- tion of machine translation systems. In Proc. of WMT. Kevin Duh, Graham Neubig, Katsuhito Sudoh, and Ha- jime Tsukada. 2013. Adaptation data selection us- ing neural language models: Experiments in machine translation. In Proceedings of the ACL. Vladimir Eidelman, Jordan Boyd-Graber, and Philip Resnik. 2012. Topic models for dynamic translation model adaptation. In ACL (Short Papers). George Foster, Cyril Goutte, and Roland Kuhn. 2010. Discriminative instance weighting for domain adapta- tion in statistical machine translation. In Proceedings of EMNLP. Michel Galley and Christopher D. Manning. 2008. A simple and effective hierarchical phrase reordering model. In Proceedings of EMNLP. Kuzman Ganchev, Ben Taskar, Fernando Pereira, and Jo ao Gama. 2009. Posterior vs parameter sparsity in latent variable models. In Proceedings of NIPS. Barry Haddow. 2013. Applying pairwise ranked optimi- sation to improve the interpolation of translation mod- els. In Proceedings of NAACL-HLT. Eva Hasler, Phil Blunsom, Philipp Koehn, and Barry Haddow. 2014. Dynamic topic adaptation for phrase- based mt. In Proceedings of EACL. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan Clark, and Philipp Koehn. 2013. Scalable Modified Kneser-Ney Language Model Estimation. In Proceedings of the ACL (Volume 2: Short Papers). Yuening Hu, Ke Zhai, Vladimir Eidelman, and Jordan Boyd-Graber. 2014. Polylingual tree-based topic models for translation domain adaptation. In Proceed- ings of the ACL. Ann Irvine, John Morgan, Marine Carpuat, Daume Hal III, and Dragos Munteanu. 2013. Measuring machine translation errors in new domains. In TACL. Mark Johnson. 2007. Why doesn’t EM find good HMM POS-taggers? In Proceedings of EMNLP-CoNLL. Shafiq Joty, Hassan Sajjad, Nadir Durrani, Kamla Al- Mannai, Ahmed Abdelali, and Stephan Vogel. 2015. How to avoid unwanted pregnancies: Domain adapta- tion using neural network models. In Proceedings of EMNLP. Katrin Kirchhoff and Jeff Bilmes. 2014. Submodularity for data selection in machine translation. In EMNLP. Philipp Koehn and Barry Haddow. 2012. Towards effec- tive use of training data in statistical machine transla- tion. In Proceedings of the WMT. Philipp Koehn and Josh Schroeder. 2007. Experiments in domain adaptation for statistical machine translation. In Proceedings of WMT. Philipp Koehn, Franz Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of NAACL. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Con- stantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceed- ings of ACL. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of MTSummit. 111 Shankar Kumar and William J. Byrne. 2004. Minimum bayes-risk decoding for statistical machine translation. In HLT-NAACL. Saab Mansour and Hermann Ney. 2014. Unsupervised adaptation for statistical machine translation. In Pro- ceedings of WMT. Spyros Matsoukas, Antti-Veikko I. Rosti, and Bing Zhang. 2009. Discriminative corpus weight esti- mation for machine translation. In Proceedings of EMNLP. Andrew Kachites McCallum. 2002. Mal- let: A machine learning for language toolkit. http://mallet.cs.umass.edu. Markos Mylonakis and Khalil Sima’an. 2008. Phrase translation probabilities with itg priors and smoothing as learning objective. In Proceedings of EMNLP. Graham Neubig, Taro Watanabe, Eiichiro Sumita, Shin- suke Mori, and Tatsuya Kawahara. 2011. An unsuper- vised model for joint phrase alignment and extraction. In Proceedings of ACL-HLT. Franz Och and Hermann Ney. 2003. A systematic com- parison of various statistical alignment models. Com- put. Linguist., pages 19–51. Franz Och and Hermann Ney. 2004. The alignment tem- plate approach to statistical machine translation. Com- put. Linguist., pages 417–449. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: A method for automatic evalu- ation of machine translation. In Proceedings of ACL. Adam Pauls, John DeNero, and Dan Klein. 2009. Con- sensus training for consensus decoding in machine translation. In Proceedings of EMNLP. Majid Razmara, George Foster, Baskaran Sankaran, and Anoop Sarkar. 2012. Mixing multiple translation models in statistical machine translation. In Proceed- ings of ACL. Rico Sennrich, Holger Schwenk, and Walid Aransa. 2013. A multi-domain translation model framework for statistical machine translation. In Proceedings of ACL. Rico Sennrich. 2012a. Mixture-modeling with unsuper- vised clusters for domain adaptation in statistical ma- chine translation. In Proceedings of the EAMT. Rico Sennrich. 2012b. Perplexity minimization for translation model domain adaptation in statistical ma- chine translation. In Proceedings of EACL. David A. Smith and Jason Eisner. 2006. Minimum risk annealing for training log-linear models. In Proceed- ings of the COLING/ACL. Matthew Snover, Bonnie Dorr, R. Schwartz, L. Micciulla, and J. Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of AMTA. Jinsong Su, Deyi Xiong, Yang Liu, Xianpei Han, Hongyu Lin, Junfeng Yao, and Min Zhang. 2015. A context- aware topic model for statistical machine translation. In Proceedings of the ACL-IJCNLP. Ivan Titov. 2011. Domain adaptation by constraining inter-domain variability of latent feature representa- tion. In Proceedings of ACL. Marlies van der Wees, Arianna Bisazza, Wouter Weerkamp, and Christof Monz. 2015. What’s in a Domain? Analyzing Genre and Topic Differences in SMT. In Proceedings of ACL-IJCNLP (short paper). Hui Zhang and David Chiang. 2014. Kneser-Ney Smoothing on Expected Counts. In Proceedings of ACL. Min Zhang, Xinyan Xiao, Deyi Xiong, and Qun Liu. 2014. Topic-based dissimilarity and sensitivity mod- els for translation rule selection. JAIR. Biao Zhang, Jinsong Su, Deyi Xiong, Hong Duan, and Junfeng Yao. 2015. Discriminative reordering model adaptation via structural learning. In IJCAI. 112