Towards human linguistic machine translation
evaluation

Marta R. Costa-jussà

Institute for Infocomm Research. 1 Fusionopolis Way, 21-01 Connexis (South Tower)

Singapore 138632 E-mail: vismrc@i2r.a-star.edu.sg

Mireia Farrús

Pompeu Fabra University. C/ Tanger/Roc Boronat, 08018 Barcelona.
E-mail:mfarrus@upf.edu

Abstract

When evaluating machine translation outputs, linguistics is usually taken into
account implicitly. Annotators have to decide whether a sentence is better
than another or not, using, for example, adequacy and fluency criteria or, as
recently proposed, editing the translation output so that it has the same meaning
as a reference translation, and it is understandable. Therefore, the important
fields of linguistics of meaning (semantics) and grammar (syntax) are indirectly
considered.

In this study, we propose to go one step further towards a linguistic human
evaluation. The idea is to introduce linguistics implicitly by formulating precise
guidelines. These guidelines strictly mark the difference between the sub-fields
of linguistics such as: morphology, syntax, semantics, and orthography. We
show our guidelines have a high inter-annotation agreement and wide error
coverage. Additionally, we examine how the linguistic human evaluation data
correlate with: among different types of machine translation systems (rule and
statistical-based); and with adequacy and fluency.

Keywords: Machine translation, linguistic evaluation

1. Introduction

Evaluation in machine translation is a challenging task. As a consequence
of the increased interest in enhancing machine translation systems, there is a
correspondent interest in improving machine translation evaluation.

Evaluation of a translation output is not an easy task even for human be-
ings because translation involves different types of knowledge, such as linguistic
and cultural. Different translators may have different criteria. However, hu-
man judgments of performance have been the gold standard of MT evaluation
metrics.

Preprint submitted to Elsevier December 14, 2016


There have been several proposals for human evaluation which have been
widely used by the scientific community. Measuring in adequacy and fluency
was proposed by [17] and it is still a standard evaluation criteria. Adequacy is a
rating of how much information is transferred between the source and the target
language, and fluency is a rating of how good the target language is.

The most recent human evaluation approach that was chosen as the official
machine translation evaluation metric for DARPA’s Global Autonomous Lan-
guage Exploitation (GALE) program [12] was HTER (Human-targeted Transla-
tion Edit Rate). HTER involves a procedure for creating targeted translations.
Annotators compare the translation output against a reference translation, and
they modify the output so that it has the same meaning as the reference, and
is understandable. Each inserted/deleted/modified word or punctuation mark
counts as one edit, while shifting a string of any number of words, by any dis-
tance, counts as one edit [13].

Other works such as [16] propose a 5-category schema that does not use
linguistic criteria. The errors are classified in five big classes: incorrect words,
missing words, word order, unknown words and punctuation. Flanagan classifi-
cation [6] lists a series of errors that are language pair dependent. The author
classifies the errors in 19 different categories for the English-to-French transla-
tion, plus three more categories to be added in the English-to-German transla-
tion. Evaluations of different MT systems for a range of linguistic checkpoints
have been carried out for English-Chinese [18], Italian-English, German-English
and Dutch-English [11].

As far as we are concerned, the above evaluations (except for adequacy and
fluency ) do not report an inter-annotation agreement study. In any case, there
has not been a formal proposal of linguistic evaluation guidelines for machine
translation. The main advantages of a linguistic evaluation would be:

• Propose precise linguistic guidelines that allow for a high inter-annotation
agreement.

• Provide a linguistic classification of the translation output errors.

• Provide new information to enhance the machine translation systems.

• Evaluation is done without a reference.

The main drawbacks of such an evaluation would be that it requires bilingual
annotators and it would be time consuming. However, nowadays we can take
advantage of crowd-sourcing platforms (such as Amazon’s Mechanical Turk1,
Crowdflower2) to reduce these types of drawbacks. Crowd-sourcing enables re-
questers to tap from a global pool of non-experts to obtain rapid and affordable
answers to simple Human Intelligence Tasks (HITs), which can be subsequently

1https://www.mturk.com
2http://crowdflower.com/

2


used to train data-driven applications. A number of recent papers on this sub-
ject point out that non-expert annotations, if produced in a sufficient quantity,
can rival and even surpass the quality of expert annotations, often at a much
lower cost [14], [15]. However, this possible increase in quality depends on the
task at hand and on an adequate HIT design [7], which motivates the creation
of detailed guidelines.

The rest of the paper is organized as follows. Next section briefly describes
the linguistic guidelines. Section 3 reports the experimental results with these
linguistic guidelines. Particularly, we exploit the linguistic guidelines to show
correlation results at the segment level between linguistic evaluations and differ-
ent types of systems. Additionally, we test the linguistic guidelines by computing
the correlation with adequacy and fluency results. Finally, section 4 discusses
most relevant conclusions.

2. Linguistic guidelines

We consider that linguistic guidelines for a machine translation system should
be specific for the target language. However, they may be generalizable for dif-
ferent source languages. In this case, we are using guidelines specific for the
Catalan language. The guidelines consider four relevant linguistic evaluations:
orthographic (language writing standardization); morphological (internal struc-
tures of words and how they can be modified); semantic (meaning of individual
words and combinations, and how these form the meanings of sentences); and
syntactic (word combination to form grammatical sentences). The guidelines
should classify any error committed by a translation system into one of these
categories.

The linguistic guidelines have been designed for the Catalan target language
using the translation output of the Universitat Politècnica de Catalunya (UPC)
statistical machine translation system [9] over a Spanish-to-Catalan test set3.
The guidelines were designed by a Catalan linguist. Next, the annotation guide-
lines are summarized.

• Orthographic errors include punctuation marks, erroneous accents, let-
ter capitalization, joined words, spare blanks coming from a wrong deto-
kenisation, apostrophes, conjunctions and errors in foreign words.

1. Punctuation marks. Include a wrong use, missing punctuation and
extra punctuation (exclamation and interrogation marks, full stops,
commas, colons, semicolons, dots, etc.). E.g.
Source: Es factible, pero hay que tener en cuenta tres obviedades:
Target: És factible, ara cal tenir en compte tres obvietats.

3This test set was of 711 sentences (around 16k words) extracted from El Páıs and La
Vanguardia newspapers [4]

3


2. Accents. Include accented vowels when not necessary, missing accents
and erroneous accents. E.g.
Source: La llegada de Obama y la situación interna del régimen is-
lamista deparan una oportunidad.
Target: *L’arribada d’Obama i la situacio* interna del règim is-
lamista ofereixen una oportunitat.
Correct target: L’arribada d’Obama i la situació interna del règim
islamista ofereixen una oportunitat.

3. Capital and lower case letters. This refers to wrong capital letters
within a sentence, lower case letters at the beginning of a sentence,
and lower case letters in acronyms or proper nouns. E.g.
Source: El enorme peligro de este camino seŕıa privar a un régimen
aislado y teóricamente revolucionario del enemigo supremo.
Target: L’enorme perill d’aquest camı́ seria privar a un règim äıllat
i teòricament revolucionari de l’enemic Suprem.

4. Joined words. This is a less common error committed where two
consecutive words are erroneously joined. E.g.
Source: Pero, aun siendo funcional para resolver y a la vista de los
resultados (...)
Target: *Però, fins i tot sent funcional per resoldre ia la vista dels
resultats.
Correct target: Però, fins i tot sent funcional per resoldre i a la vista
dels resultats.

5. Extra spaces. This error is usually committed due to non-detokenising
when required or detokenising into the wrong direction. E.g.
Source: ”hola”
Target: ” hola ”

6. Apostrophe. Apostrophe is commonly used in Catalan to elide a
sound. In some cases, some of the words that should be apos-
trophofised are not apostrophised (missing apostrophe) and viceversa
(extra apostrophe). E.g.
Source: Sólo hace 25 años que sabemos la historia que se oculta tras
esa imagen turbadora
Target: *Només fa 25 anys que sabem la història que se amaga dar-
rere aquesta imatge torbadora.
Correct target: Només fa 25 anys que sabem la història que s’amaga
darrere aquesta imatge torbadora.

• Morphological errors include lack of gender and number concordance,
apocopes, errors in verbal morphology (inflection) and lexical morphol-
ogy (derivation and compounding), and morphosyntactic changes due to
changes in syntactic structures.

1. Lack of gender concordance. Some words are given a different gender
in different languages. For instance, the word smile is feminine in
Spanish (la sonrisa) and masculine in Catalan (el somriure). It is

4


then common to find a lack of gender concordance in articles and
adjectives with a noun that changes its gender from one language to
the other, especially in statistical systems, where there are no rules
to solve it. E g.
Source: El balón llegó tarde
Target: *El pilota va arribar tard.
Correct target: La pilota va arribar tard.

2. Lack of number concordance. Although it is less common, some
words are given a different number in different languages. For in-
stance, the word money is singular in Spanish (el dinero) and plural
in Catalan (els diners). Like in the previous case, this causes a lack of
number concordance in articles and adjectives with the consecutive
noun, especially also in statistical systems, where there are no rules
to solve it. E. g
Source: El gobierno se ha gastado todo el dinero de los ciudadanos.
Target: *El govern s’ha gastat tot el diners dels ciutadans.
Correct target: El govern s’ha gastat tots el diners dels ciutadans.

3. Verbal morphology. This error refers to a verb that is not correctly
inflected, a common error in a very inflected language such as Cata-
lan. The most common cases are the translation of an inflected verb
into the infinitive form, or the lack of person concordance.
Source: El mismo que usted puede ahora constatar en la exposición
Bacon.
Target: *El mateix que vostè pugues ara constatar en l’exposició Ba-
con.
Correct target: El mateix que vostè pot ara constatar en l’exposició
Bacon.

4. Lexical morphology It concerns basically word formation: derivation
and compounding, like the use of a derivate in a wrong way (e.g.
lliguer instead of de la Lliga) or a wrong compounding (e.g. històric-
social instead of historicosocial ).

• Semantic errors include no correspondence between source and target
words, non-translated but necessary source words, missing target words,
and non-translated proper nouns or translated when not necessary. Addi-
tionally, includes polysemy, homonym, and expressions used in a different
way in the source and target languages.

1. Polysemy. A polyseme is a word with multiple meaning, which shares
the same origin. A polysemic error occurs when the incorrect mean-
ing is chosen in the target language. E.g. the Catalan conjunction
perquè has two different meanings: porque (because) and para que (in
order to), which causes usually translation errors.

2. Homonymy. Homonymy is found when two or more words share the
same spelling and the same pronunciation but have different mean-
ings, usually as a result of having different origins. Like polysemes,

5


homonym errors occur when the incorrect meaning is chosen in the
target language. E.g. the Spanish adverb solo, which can also be
and adjective, is translated by the Catalan adjective sol instead of
the corresponding adverb només.

3. Incorrect word. This error is detected when there is no correspon-
dence at all between the source word and the translated target word.
It is normally found in statistical systems, where the word is trans-
lated incorrectly mainly due to alignment errors. E.g.
Source: No llegaron hasta las cuatro de la tarde.
Target: *No van arribar fins a les quatre.
Correct target: S’arreglen més sabates que mai,

4. Unknown word. This refers to a non-translated source word, which
is left intact in the target side. E.g.
Source: el caso Dutroux, que convulsionó a Bélgica a principios de
los noventa,
Target: *el cas Dutroux, que convulsionó Bèlgica a principis dels
noranta
Correct target: el cas Dutroux, que va convulsionar Bèlgica a prin-
cipis dels noranta

5. Missing target word This refers to a non-translated source word,
which is missing in the target side. E.g.
Source: Fue una decisión impopular, pero seguramente justa.
Target: Va ser una decisió impopular, segurament justa.

6. Proper nouns This error concerns non-translated proper nouns (i.e.
unknown proper noun) or translated when not necessary (for in-
stance, not being detected as proper noun but as common noun to
be translated). E.g.
Source: Zapatero se negó.
Target: *El sabater s’hi va negar.
Correct target: Zapatero s’hi va negar.

• Syntactic errors include errors in prepositions, errors in relative clauses,
verbal periphrasis, clitics, missing or spare article in front of proper nouns,
and syntactic element reordering.

1. Prepositions This error refers to prepositions not elided in the target
language (extra prepositions), prepositions not inserted in the target
language (missing prepositions), or source prepositions maintained
in the target language instead of a new correct target preposition
(incorrect prepositions). E.g.
Source: Debeŕıa ser recusado en favor de otro juez.
Target: *Hauria de ser recusat en favor d’un altre jutge.
Correct target: Hauria de ser recusat a favor d’un altre jutge.

2. Relative pronouns. Due to its syntactic complexity, the use of rela-
tive clauses involving relative pronouns refering to previous elements
usually leads to erroneous translations. E.g.

6


Source: Murieron tres personas al colisionar un Ford Scort con un
Renault Scenic cuyo conductor sufrió heridas leves.
Target: *Hi van morir tres persones al topar un Ford Scort amb un
Renault Scénic amb un conductor va patir ferides lleus.
Correct target: Hi van morir tres persones en topar un Ford Scort
amb un Renault Scénic el conductor del qual va patir ferides lleus.

3. Verbal periphrasis. The use of verbal periphrasis, especially when
they involve prepositions that differ in the different languages, usually
leads to translation errors, as well (e.g. the Spanish verbal periphrasis
tener que (have to) is usually translated literally into Catalan as tenir
que instead of the correct periphrasis haver de).

4. Clitics Include an incorrect syntactic function of the pronoun or a
wrong clitic-verb combination. E.g.
Source: El niño se cayó por las escaleras de su casa.
Target: *El nen es va caure per les escales de casa seva.
Correct target: El nen va caure per les escales de casa seva.

5. Articles. This error refers to missing or extra articles in front of
proper nouns. E.g.
Source: Rosa entró en el despacho del dueo
Target: *Rosa va entrar al despatx del propietari
Correct target: La Rosa va entrar al despatx del propietari

6. Reordering. It refers to a syntactic reordering of the elements of the
sentence.

A list of the linguistic errors can be found in Table 1.

Orthographic Morphologic Semantic Syntactic
Puntuation marks Gender concordance Polysemy Prepositions

Accents Number concordance Homonymy Relative pronouns
Capital and lower case Verbal morphology Incorrect words Verbal periphrasis

letters
Joined words Lexical morphology Unknown words Clitics
Extra spaces Missing target word Articles
Apostrophe Proper nouns Reordering

Table 1: Guidelines summary.

3. Experiments

This section describes the experiments that were designed to evaluate the
performance of the linguistic guidelines briefly reported in the previous sec-
tion. First, we wanted to evaluate the inter-annotation agreement. Second, we
wanted to test the coverage of the linguistic errors and the generalization to a
difference source language. Finally, we compute the correlation of the linguis-
tic evaluations: among different translation systems, and with standard human
evaluation methods such as adequacy and fluency.

7


3.1. Data set

The test corpus falls within the medicine domain. This medical corpus was
kindly provided by the UniversalDoctor project, which focuses on facilitating
communication between health-care providers and patients from various origins
4. Table 2 summarizes the number of sentences, words and vocabulary of the
medical corpus.

English
Sentences 630

Words 4073
Vocabulary 1050

Table 2: Corpus statistics of the English medical test set.

3.2. Machine translation systems

As translation systems we used 4 freely available systems in the web. They
include two rule-based MT (RBMT) systems, Apertium and Translendium, and
two statistical MT (SMT) systems, Google Translate and UPC. All systems are
used with their respective versions date of 1st of February 2010.

• Apertium platform5 is an open-source RBMT system originally based on
existing translation systems that have been designed by the Transducens
group at the Universitat d’Alacant (UA). The system uses a shallow-
transfer machine translation technology.

• Translendium6 is developed by Translendium S.L., a Catalan company
located in Barcelona and subsidiary of the European group Lucy Software,
made up of linguists and computer scientists with more than fifteen years
of experience in the machine translation field. The translation engine
consists of a modular structure of computational grammars and lexicons
that makes possible to carry out a morphosyntactic analysis of the source
text and then transfers it into the target language.

• Google Translate7 is a SMT system developed by Google’s research
group for more than 50 languages. The system uses billions of words of
text, both monolingual text in the target language. Google is constantly
working to support more languages and introduce them as soon as the
automatic translation meets their standards.

4http://www.universaldoctor.com
5http://www.apertium.org/
6http://www.translendium.com
7http://translate.google.com/

8


• UPC system8 is developed at the Universitat Politècnica de Catalunya.
Based on a Ngram translation model integrated in an optimized log-linear
combination of additional features, it is mainly a statistical system, al-
though it also includes additional linguistic rules to solve some errors
caused by the statistical translation [4].

3.3. Inter-annotation agreement in adequacy and fluency human evaluation

The evaluation in adequacy and fluency was performed by three annotators
Catalan native and fluent in English. The rank of adequacy and fluency was
from 1 (good) to 5 (bad). All annotators evaluated 2520 (630*4) sentences both
in adequacy and fluency. The inter-annotation agreement was evaluated with
the weighted kappa [2] using a quadratic distance between errors. The weighted
kappa was 0.62 which is qualified as ’good’ according to [8].

3.4. Inter-annotation agreement in the linguistic human evaluation

The linguistic evaluation was performed by three annotators Catalan na-
tive and fluent in English. The errors are reported according to the following
linguistic evaluations: orthographic, morphological, semantic and syntactic, as
described in section 2.

Annotators were not able to find one single error that was not reported in
the guidelines. This was one of the main objectives of the guidelines and it is a
great achievement because the guidelines were designed on a different set from
the test set with a different source language. This means that these guidelines
designed for a particular target language may be used for different languages
pair with common target.

We evaluated the inter-annotation agreement with the weighted kappa (k)
[2] using a linear unitary distance between errors.

k = 1 −
∑k

i=i

∑k
j=i wijxij∑k

i=i

∑k
j=i wijmij

where k is the number of codes (in our case four categories) and wij,xij,mij
are elements in the weight, observed and expected matrices, respectively. The
weighted kappa was 0.75 which is good according to [8]. This kappa is quite
high when comparing it to other inter-annotation kappas in MT evaluation [1]
and it is due to the accurate design of the linguistic guidelines.

To sum up, we are boosting kappa by giving strict guidelines, which is dif-
ferent from relying on the holistic evaluation that provides the adequacy and
fluency criteria. Depending on the application, we would prefer one evaluation
or the other.

8http://www.n-ii.org/

9


3.5. Adequacy and fluency results

Table 3 shows the results of the translation evaluation from the 4 different
system outputs. Notice that Google is ranked the best system in adequacy and
Translendium is ranked the best system in fluency.

English-to-Catalan Adequacy Fluency
Apertium 2.9 2.5
Google 4.1 3.8
Translendium 3.9 4.0
UPC 2.8 2.0

Table 3: Adequacy and fluency results for English-to-Catalan translation output.

3.6. Linguistic human evaluation results

Table 4 shows the results of the translation evaluation from the 4 different
system outputs. Notice that the semantic errors are the more common, and the
orthographic errors are the less common. If we rank systems by orthography,
Apertium is the best system. If we rank systems by morphology or syntax,
Translendium is the best one. And if we rank systems by semantics, Google is
the best one. Therefore, this evaluation may be worth to decide which system
is better for a specific application. For example, if Ione requires tourist infor-
mation, one may be only interested in the meaning of the translation, in this
sense one may choose Google, which has the lowest number of semantic errors.

English-to-Catalan Sent. w/errors Total errors Ort. Mor. Sem. Syn.
Apertium 464 731 10 79 463 179
Google 305 492 27 72 232 161
Translendium 324 478 31 30 293 124
UPC 519 1168 33 139 715 281

Table 4: Linguistic evaluation results for English-to-Catalan translation outputs: number and
type of linguistic errors.

Previous experiments with these guidelines can be found in [4], [3] and [5].
x

3.7. Correlation between linguistic evaluations and adequacy and fluency

We performed the correlation at the level of segment between the linguistic
judgments and the adequacy and fluency criteria.

10


We performed the correlation at the level of segment using the Kendall’s
τB rank correlation among the different linguistic evaluations and systems. Let
(x1,y1), (x2,y2), ...(xn,yn) be a set of joint observations from two random vari-
ables X and Y respectively (f. e. orthography and semantics). Any pair of
observations (xi,yi) and (xj,yj) are said to be concordant if the ranks for both
elements agree: that is, if both xi > xj and yi > yj or if both xi < xj and
yi < yj. They are said to be discordant, if xi > xj and yi < yj or if xi < xj and
yi > yj. If xi = xj or yi = yj, the pair is neither concordant, nor discordant.
Given that presumably we’ve got many ties, we use the Kendall τB coefficient
which makes adjustments for ties and it is defined as:

τB =
(number of concordant pairs) − (number of discordant pairs)√

(n0 − n1)(n0 − n2)

where a concordant pair is a pair of two translations of the same segment in
which the rank given by the number of errors calculated from the corresponding
linguistic level agree; in a discordant pair, they disagree. Ties are adjusted as
shown in the denominator:

n0 = n(n − 1)/2 ; n1 =
∑
i

ti(ti − 1)/2 ; n2 =
∑
j

uj(uj − 1)/2

where n is the total number of pairs, ti is the number of tied values in the i
th

group of ties for the first quantity and uj is the number of tied values in the j
th

group of ties for the first quantity. The possible values of τB range between 1
(where all pairs are concordant) and −1 (where all pairs are discordant). Thus
the higher the value for τB the more similar the linguistic evaluations. When τB
is zero, it means linguistic evaluations are independent. In all cases, correlations
followed a statistically significant trend [10].

Here, a concordant pair is a pair of two translations of the same segment in
which the ranks calculated from the human ranking task (adequacy or fluency )
and from the number of linguistic errors of the corresponding level agree; in a
discordant pair, they disagree. The higher the value for τB the more similar the
linguistic level ranking with the human ranking either in adequacy or fluency.
In all cases, correlations followed a statistically significant trend [10].

Table 5 show the results for 2520 (630 sentences * 4 systems) segments. On
the one hand, adequacy is clearly correlated to semantics, a little to syntax
and nothing to orthography and morphology because these two levels do not
interfere in the understanding of the translation. On the other hand, fluency is
correlated with all levels in this order of major to minor importance: semantics,
syntax, orthography and morphology. In both cases, adequacy and fluency are
clearly related to the quantity of total errors provided by the system.

4. Conclusions

We proposed an alternative way of human evaluation in machine translation.
To the best of our knowledge, our proposal is the first linguistic evaluation which

11


Adequacy Fluency
Orthographic 0 .13
Morphological 0 .11
Semantic .57 .54
Syntactic .17 .25
Total errors .58 .57

Table 5: Correlation at the level of segment between linguistic evaluation and adequacy and
fluency.

.

has been tested in detail providing good inter-annotation agreement, excellent
error coverage and informative segment correlation with the standard human
evaluation methodology of adequacy and fluency. In this sense, linguistic guide-
lines have been shown useful for machine translation evaluation.

This methodology has been proved to achieve a really high inter-annotation
agreement (a kappa of 0.75) which should be one of the main goals in ma-
chine translation evaluation. The level of agreement achieved is quite surprising
specially if we take into account that the evaluation does not use a reference
translation.

Moreover, the linguistic guidelines, designed for Spanish-to-Catalan and spe-
cific for the target language (Catalan), have shown generalizable for a different
source language (English). Annotators could not find one single error that was
not specified in the guidelines. Finally, the linguistic classification of errors pro-
vides new information which has shown useful to relate linguistic errors from
different type of systems. Additionally, we have shown that annotators when
evaluating in adequacy take into account semantic and syntactic errors and when
evaluating in fluency take somehow all type of errors into account. Our inten-
tion with this correlation analysis was not to reach specially high correlations,
but to show how linguistic evaluations are related when studying translation
outputs.

In further work, we would like to investigate how these linguistic guidelines
work over a crowd-sourcing platform and how this new linguistic information
can be used to improve machine translation systems.

5. Acknowledgments

The authors would like to thank the Institute for Infocomm Research for
their support and permission to publish this research.

This work has been partially funded by the Seventh Framework Program of
the European Commission through the International Outgoing Fellowship Marie
Curie Action (IMTraP-2011-29951).

[1] Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki,
M., Zaidan, O.. Findings of the 2010 joint workshop on statistical ma-

12


chine translation and metrics for machine translation. In: Proceedings
of the Joint Fifth Workshop on Statistical Machine Translation and Met-
ricsMATR. Uppsala, Sweden: Association for Computational Linguistics;
2010. p. 17–53.

[2] Coehn, J.. Weighted kappa: Nominal scale agreement and the maximum
value of kappa. Educational an Psychological Measurement 1968;(49):835–
850.

[3] Costa-jussà, M., Farrús, M., Mariño, J., Fonollosa, J.. Study and com-
parison of rule-based and statistical catalan-spanish machine translation
systems. Accepted in Computing and Informatics journal 2011;.

[4] Farrús, M., Costa-jussà, M., Poch, M., Hernández, A., Mariño, J.. Im-
proving a catalan-spanish statistical translation system using morphosyn-
tactic knowledge. In: 13th Annual meeting of the EAMT: European Asso-
ciation for Machine Translation. Barcelona; 2009. p. 52–57.

[5] Farrús, M., Costa-Jussà, M.R., Popovic, M.. Study and correlation analy-
sis of linguistic, perceptual, and automatic machine translation evaluations.
JASIST 2012;63(1):174–184.

[6] Flanagan, M.A.. Error classification for mt evaluation. In: Proc. of the
AMTA. Columbia; 1994. p. 65–72.

[7] Kittur, A., Chi, E.H., Suh, B.. Crowdsourcing user studies with mechan-
ical turk. In: Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems. New York, NY, USA: ACM; CHI ’08; 2008. p.
453–456.

[8] Landis, J.R., Koch, G.G.. The measurement of observer agreement for
categorical data. Biometrics 1977;33:157–174.

[9] Mariño, J., Banchs, R., Crego, J., de Gispert, A., Lambert, P., Fonollosa,
J., Costa-jussà, M.. N-gram based machine translation. Computational
Linguistics 2006;32(4):527–549.

[10] McBride, G.. Anomalies and remedies in non-parametric seasonal trend
tests and estimates; 2000. National Institute of Water and Atmospheric
Research, Hamilton.

[11] Naskar, S.K., Toral, A., Gaspari, F., Way, A.. A framework for diagnostic
evaluation of mt based on linguistic checkpoints. In: Proceedings of the
13th Machine Translation Summit. Xiamen, China; 2011. p. 529–536.

[12] Olive, J.. Global autonomous language exploitation.
DARPA/IPTOProposer Information Pamphlet 2005;.

[13] Snover, M., Madnani, N., Dorr, B.J., Schwartz., R.. Ter-plus: para-
phrase, semantic, and alignment enhancements to translation edit rate.
Machine Translation 2010;23(2-3):117–127.

13


[14] Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.. Cheap and fastbut is it
good?: evaluating non-expert annotations for natural language tasks. In:
Proceedings of the Conference on Empirical Methods in Natural Language
Processing. 2008. p. 254–263.

[15] Su, Q., Pavlov, D., Chow, J.H., Baker, W.C.. Internet-scale collection of
human-reviewed data. In: Proceedings of the 16th international conference
on World Wide Web. 2007. p. 231–240.

[16] Vilar, D., Xu, J., Fernando-D’Haro, L., Ney, H.. Error analysis of
statistical machine translation output. In: Proc. of the LREC. Genoa,
Italy; 2006. .

[17] White, J., O’Connell, T., O’Mara, F.. The arpa mt evaluation method-
ologies: Evolution, lessons, and future approaches. In: Proc. of the 1st
Conference of the Association for Machine Translation in the Americas.
Columbia; 1994. p. 193–205.

[18] Zhou, M., Wang, B., Liu, S., Li, M., Zhang, D., Zhao, T.. Diagnostic
evaluation of machine translation systems using automatically constructed
linguistic check-points. In: Proceedings of the 22nd International Con-
ference on Computational Linguistics COLING08. Stroudsburg, PA, USA;
volume 1; 2008. p. 1112–1128.

14