key: cord-0326021-k3g3p6ib
authors: Cao, Jialun; Li, Meiziniu; Li, Yeting; Wen, Ming; Cheung, Shing-Chi
title: SemMT: A Semantic-based Testing Approach for Machine Translation Systems
date: 2020-12-03
journal: nan
DOI: 10.1145/3490488
sha: 5ecc99f0ca4a3fc88f9ef49f1aff7478327817cc
doc_id: 326021
cord_uid: k3g3p6ib

Machine translation has wide applications in daily life. In mission-critical applications such as translating official documents, incorrect translation can have unpleasant or sometimes catastrophic consequences. This motivates recent research on testing methodologies for machine translation systems. Existing methodologies mostly rely on metamorphic relations designed at the textual level (e.g., Levenshtein distance) or syntactic level (e.g., the distance between grammar structures) to determine the correctness of translation results. However, these metamorphic relations do not consider whether the original and translated sentences have the same meaning (i.e., Semantic similarity). Therefore, in this paper, we propose SemMT, an automatic testing approach for machine translation systems based on semantic similarity checking. SemMT applies round-trip translation and measures the semantic similarity between the original and translated sentences. Our insight is that the semantics expressed by the logic and numeric constraint in sentences can be captured using regular expressions (or deterministic finite automata) where efficient equivalence/similarity checking algorithms are available. Leveraging the insight, we propose three semantic similarity metrics and implement them in SemMT. The experiment result reveals SemMT can achieve higher effectiveness compared with state-of-the-art works, achieving an increase of 21% and 23% on accuracy and F-Score, respectively. We also explore potential improvements that can be achieved when proper combinations of metrics are adopted. Finally, we discuss a solution to locate the suspicious trip in round-trip translation, which may shed lights on further exploration.

Machine translation systems, which provide automatic translation for text and speech from a source language to another target language, are widely used in daily lives [38, 39] . However, machine translation systems can give incorrect or inappropriate translations, which could lead to harmful consequences such as embarrassment, misunderstanding, financial loss, or political conflicts [26, 55, 61, 62] . This motivates the research on testing methodologies to assure the quality of machine translation software.

Recent works [36, 39, 66, 83, 90, 99] on machine translation testing mostly adopt the metamorphic testing approach [19, 20] . The intuition is that similar sentences should be similarly translated [39, 83, 99] , while sentences with different meanings should not have the same translations [36] . Mistranslations are detected by examining the textual (e.g., Levenshtein distance) or syntactic similarities between the original sentence and the translated sentence. However, a close textual or syntactic distance between two sentences does not necessarily imply close semantic meaning, and thus cannot guarantee the correctness of translation. For example, in Fig. 1 (Example 1), the verbs "include" and "exclude" are logically opposite, though their textual/syntactic distances are the same or close. Consequently, while the textual and syntactic distances between the two sentences are close, they deliver opposite semantic meanings. Using semantic-based metrics such as the state-of-the-art SBERT [73] , the similarity measured over the two sentences in Example 1 is low (i.e., only 40%), which indicates that the differences can be more precisely captured in this case. However, SBERT does not necessarily perform well in all cases. In the Example 2 of Fig. 1 , SBERT gives a high semantic similarity (93%) although the phrase "at least" and "at most" in E3 and E4 are two opposite quantifiers. The two examples suggest that a better metric to measure sentence semantics is needed.

However, the semantic meaning of sentences is hard to be precisely captured, which exacerbates the challenges of measuring translation correctness. It is non-trivial to judge the semantic equivalence/similarity between two sentences even for human [98] . The judgement made by human can be subjective [66, 99] , making the decision varies across different individuals. Besides, the flexibility of natural language makes this problem even more challenging [99] . For example, a token or phrase may have multiple correct translations, in such cases, modern translation software does not perform well [39] .

In view of these challenges, we then approach the problem by confining the scope to the semantics that concern quantifiers and logical relations (like examples in Fig. 1) , and then test translation on sentences with such semantics. For what it may worth to mention, sentences with quantifiers [68, 81] and logical relations- [21, 37, 82] are pervasive in the daily life and of central importance in linguistic semantics [9, 67, 68, 84] . According to our investigation (see Section 2), in every six sentences, there is one sentence involving quantifiers or logical relations, reflecting the prevalent use of such sentences in daily life. Since they mainly specify the quantity and logical meaning of the objects, and are commonly found in legal contracts, financial statements, healthcare reports, product instructions and so on, the mistranslation regarding them can cause misunderstanding or severe consequences. For example, in Fig. 2 1 , the quantifier "30 times less than" in the first sentence (i.e., Original 1) is mistranslated to "30 times" by Google translator (i.e., Translation 1). The translation mistakenly reports the children's morbidity and mortality. It can result in unnecessary public panic. Note that the original sentence is taken from a brief policy launched recently [71] on the impact of COVID-19 on children. Similarly, in the second example excerpted from a report on Diseases and Mortality [25] , the number of children who suffered from disease is exaggerated after translation. Such mistranslations can cause severe consequences for the public. Apart from mistranslation of quantifiers, mistranslation of logical relations are also common. The third example (i.e., Original 3) taken from [54] in Fig. 2 illustrates the mistranslation of a logical relation, conveying opposite meaning (i.e., exclusion and inclusion) before and after translation, resulting in a wrong cognition of the household situation in northwest. Earlier work [36] has proposed to detect mistranslations based on an intuition that semantically different sentences should not have the same translation results. However, such detection technique may not be effective because many mistranslated sentences and their semantically mutated ones do have different translation results. For example, the sentence given by Original 3 in Fig 2 is mistranslated to Chinese by Google. Yet, Google gives a different Chinese translation when the word "not" is removed from the sentence, escaping such detection of mistranslation based on the same translation results. The same situation occurs in the translation of the other two sentences in Fig 2 ( i.e., replacing "less" with "more" in Original 1, and replacing reversely in Original 2). As such, it is important to design an effective test methodology for the translation of sentences that contain quantifiers and logical relations. Such testing methodologies have not been studied in prior work.

In this paper, we propose SemMT, an automatic testing approach for machine translation systems based on semantic similarity checking of the concerned quantifiers and logical relations. The insight of SemMT is that the semantics concerning logical relations and quantifiers in sentences can be captured by regular expressions (or deterministic finite automata) where efficient equivalence/similarity checking algorithms are available. To be more specific, SemMT addresses the difficulty of capturing semantics similarities precisely and detecting semantic mistranslation on translation systems using the following three strategies:

Transformation to regular expressions. Since the semantics regarding quantifiers and logical relations in sentences can be captured using regular expressions, the core step of SemMT is to transform the sentences to the semantic-equivalent regular expressions. This strategy enables the maximized precise semantic similarity capturing. As a well-explored, -tested and widely-used context-free grammar [15, 27, 87, 89, 97] , the semantics over regular expression (abbrev. regex) can be evaluated under a context-free paradigm, enabling us to capture and quantify semantic similarities precisely.

Precise semantic similarity capturing. Based on the above strategy, the semantic similarities over regular expressions then can be captured and quantified via well-established algorithms in formal language [41, 64, 96] . If the semantic similarity quantified by these algorithms between the original and translated sentences falls beyond the similarity threshold, such translations will be considered as suspicious mistranslations.

Semantic checking in the same language. Semantic equivalence can hardly be measured automatically across different languages [93] . Therefore, SemMT performs testing on round-trip translation, which translates a

The share of symptomatic children who lose their lives to the virus in China has been estimated as 1 in 25,000, which is 30 times less than of the middle aged and 3,000 times less than the elderly.

Quantity-related 据估计，在中国，因病毒而丧生的有症状儿童的比例为25,000人中的1人，这是中年人的30倍，而 中年人的3,000倍。

(Meaning: It is estimated that the proportion of symptomatic children killed by the virus in China is 1 in 25,000, which is 30 times that of middle-aged people and 3,000 times that of middle-aged people.)

Diarrheal disease was the cause of every tenth child's death in 2017 -more than half a million of the 5.4 million children that died in 2017 died from diarrheal disease. original sentence to another language, and then translates it back. In this way, the back-translated sentence and the original sentence are in the same language, allowing their semantics to be uniformly measured and compared. On top of that, we implemented SemMT and compared with the state-of-the-art testing techniques. The experiment results show that SemMT achieves an increase of 23% in terms of F-Score as compared with other similarity measures. SemMT outperforms the state-of-the-art techniques, achieving an improvement of 34.2% in accuracy with comparable amount of mistranlations detected. Besides, SemMT can detect 213 bugs in Google Translator as compared with 173 detected by other similarity metrics. We also study the possibilities of improving accuracy and F-Score by combining different similarity metrics. Note that a mistranlation detected by SemMT may occur at the forward translation or the backward translation [66, 80, 99] . We discuss a method to locate the translation in which the bug resides. In addition, we discuss the types of bugs detected by SemMT.

To sum up, this paper addresses the oracle problem in testing machine translation systems with respect to their semantics. Specifically, it makes the following four main contributions:

• We propose SemMT, a semantic-based machine translation testing framework. Specifically, it captures the semantics of quantifiers and logical relations during translation by semantic-equivalent regular language and detects mistranslation accordingly. To best of our knowledge, it is the first testing methodology proposed for machine translation systems based on the semantic similarity. • We introduce an approach to determining and quantifying semantic differences. Specifically, we transform natural language to semantic-equivalent regular expressions, and then propose a metric to measure semantic similarities in formal ways. • The experiment results show that our proposed metrics are more effective than existing similarity metrics in capturing the semantics of sentences that concern quantifiers and logical relations. Using the proposed 

Measuring the semantic difference/similarity of natural languages is still an open problem due to their subjective [66, 99] , flexible [99] and context-aware [50, 88] features. We, therefore, confine our measurement to the semantics of quantifiers and logical relations, which can be represented formally by regular expressions. A quantifier is a word/phrase that usually goes before a noun to express the quantity of the object. It is traditionally defined using set-theoretic terms in linguistic theories [8, 47, 52, 68, 81] . Commonly used quantifiers include proportional quantifiers (e.g., "some", "a few", "many" and "more than half"), logic quantifiers (e.g., "none"), and quantifiable quantifiers (e.g., "more than half" and "more than 3 times"). We refer to those sentences that contain quantifiers as quantifier-related sentences. For logical relations, we follow existing works [5, 58, 59, 65, 70, 76] and focus on those semantics expressed in first-order logic such as conjunction, disjunction, negation, inclusive and exclusive relations, such as "X contains Y". We refer to those sentences that contain such logical semantics as logic-related sentences.

Since we confine our scope to quantifier-and logic-related sentences, a legitimate question is: Are quantifier-and logic-related sentences common? To answer this question, we collected five large-scale corpora (i.e., Europarl [44] , CommonCrawl and News available from the Workshop on Machine Translation (WMT) [1], News Commentary Parallel Corpus [85] , Financial News from Reuters [29] ) which are commonly-used for machine translation and other natural language processing tasks, and analyzed the proportion of sentences that involve quantifiers or logic relations. The corpora that we analyzed cover a broad range from policy, finance to daily news, which indicates that the findings made based on them can be generalized. We followed the methodology of earlier work [3, 68] to identify quantifier-related sentences and logic-related sentences. The list of patterns that we used for the analysis is publicly available [2]. Table 1 shows the statistics of the investigation. Initially, over 4 million sentences were collected. After filtering out those containing less than 10 words (e.g., the sentence "I understand"), 2 the statistics were conducted over 3.7 million sentences. According to Table 1 , there are 639,179 (16.96%) sentences that are quantifier-or logicrelated. It reflects the popular use of such sentences in daily life. For a further breakdown, 12.52% sentences are quantifier-related, and 5.22% are logic-related, respectively.

While not all of these quantifiers and logical relations can be precisely captured by regexes (e.g., "a few" and "many" cannot be quantified by an exact number), we further measured the proportion of quantifiers and logical relations that can be precisely captured by regexes. As revealed by Table 1 , 89.6% (=15.19/16.96) of the quantifierand logic-related sentences in the selected dataset can be captured precisely. A closer examination reveals that 90.8% of the quantifier-and 83.9% of the logic-related sentences can be precisely captured. Such high ratios reflect the fact that quantifier-or logic-related sentences that can be captured precisely are also pervasive, and the focus on such sentences will not largely narrow the scope of application of our approach to detect mistranslation bugs. Besides, we also discuss the situation where the semantics can be captured approximately in §5.3.

In this section, we first give an overview of SemMT, followed by the explanation of its methodology. 

A string containing only no more than 3 uppercase letters. A string containing only no more than 3 uppercase letters.

[X] containing only <Y> [X] not containing only <Y>

A string that contains only 3 or more uppercase letters. The middle and lower diagrams demonstrate two mistranslations that can be detected by SemMT 3 . The first original sentences is taken from the NL-RX-Synth dataset [53] , and the second one is taken from an online housing needs document [54] (same as the third example in Fig. 1 ). Given the original sentences, the round-trip translation is firstly conducted by the forward translation from the source language (take English as example) to the intermediate language (take Chinese as example), then backward translation to the source language. Secondly, the semantics regarding the quantifiers and logical relations are identified from the original and translated sentences and transformed into the corresponding regular expressions. The purpose of sentence abstraction is to capture the quantifiers and logical relations as precise as possible, and the details of sentence abstraction and regular expression transformation are explained in §3.3.1 and §3.3. After the second step, the semantics of the original and translated sentences can be captured by regular expressions. Take Illustrative Example 1 as an example, the meaning of two English sentences can be described by the corresponding two regular expressions

represents an uppercase letter, and quantifiers {3,} and {0,3} serve as quantifiers "3 or more" and "no more than 3", respectively. Next, semantic similarity is calculated over the two regular expressions by regex-based semantic similarity metrics. In particular, SemMT utilized three semantic similarity metrics to differentiate the semantic meanings. If the similarity is higher than the predefined threshold, the semantic meaning of the original sentence is well-preserved after translation; otherwise, SemMT reports it as a candidate round-trip mistranslation. As shown in the illustrative examples, take DFA-based similarity (details can be found in § 3.4.2) as example, the semantic similarities of two examples are relatively low (0.000) for both translations, SemMT, therefore, reports the original, intermediate and and translated sentences as a potential round-trip mistranslation.

In the following of this section, we will explain the details of the four steps using two illustrative examples, showing how SemMT is able to capture semantic difference and detect mistranslations in round-trip translations. Challenges arise in the measurement of similarity between regexes and the determination of thresholds. We will discuss them in §3.3, §3.4 and §3.5.

Round trip translation (RTT) is also known as back-and-forth translation. It translates a given text or sentence into an intermediate language (the forward translation), and then translates the result back into the source language (the back translation) [80] . It reflects the general quality of a translation system over longer texts [80] and is a cost-effective choice [60, 66, 99] to derive reference translation automatically.

The benefit of adopting RTT in our methodology is that the semantics of the original and back-translated sentences can be uniformly measured and compared in the same language. The selection of source and intermediate languages are not restricted by our methodology. Yet, the selection of language pairs will affect the effectiveness of methodology in two aspects. First, in some languages (e.g., Chinese and Japanese), nouns are the same in both singular and plural forms, while they are in different forms in other languages like English and Russian. The RTT between these two kinds of languages may lose/switch the singularity or plurality information, making the semantics changed. Second, the availability of automated transformation approaches used in the following steps (i.e., the second step in Fig. 3 ) also needs to be considered. If the automatic approaches of deriving regex from natural language sentences in the source language is unavailable, the workflow is unlikely to proceed automatically. Hence the selection of language pairs under test should take these two aspects into account. Besides, since SemMT adopts a black-box testing manner, so the translators under test can be either open-or close-sourced.

Nevertheless, one may concern that RTT involves testing two systems instead of one [66, 99] and it is unclear which trip is buggy when a sentence is mistranslated. It subsequently motivates us to explore the possibility to locate the buggy trip automatically, and we discuss one possible solution in Section 5.4.

SemMT is empowered by the recent advances made on the synthesis of regular expressions from natural languages using rule-based [48, 72] and learning-based [53, 64, 96] approaches. In this subsection, we present the key ideas of how these techniques can be applied in the methodology of SemMT. In the following, we adapt two illustrative examples in Fig. 3 as running examples to show how the resulting transformed regexes and deterministic finite automata are derived together with the introduction of the related formalism.

• Running Example 1 (Original Sentences) -S1: A string that contains only 3 or more uppercase letters.

-S2: A string containing only no more than 3 uppercase letters. 3.3.1 Sentence Abstraction. Sentence abstraction is a preprocessing step conducted on the original and translated sentences before converting them into regular expressions. It helps to focus on semantics that relate to quantifiers and logical relations. In general, the sentence abstraction works in two steps: (1) identify the non-terminal words/tokens that describe quantifiers and logical relations in the sentence, and (2) abstract the terminals or non-terminals that are grouped by the identified non-terminals as abstracted objects, denoted by symbols such as X and Y. Specifically, the non-terminals and terminals we used are specified and used in existing works [53, 64] , and are associated verbalization for both regular expressions and language descriptions. For the first running example, the terminals (e.g., "uppercase letters") and non-terminals (e.g., "contains only", "3 or more", "no more than 3") are specified in [53] , and therefore no abstraction is needed. While for the second running example, the terminal nouns/phrases grouped by non-terminals (i.e., "not containing only" and "containing only") are not specified, so they are abstracted and denoted as abstract symbols X and Y. Hence, after abstraction, the sentences in two running examples are as follows:

• Running Example 1 (Abstracted Sentences) -S1': A string that contains only 3 or more uppercase letters.

-S2': A string containing only no more than 3 uppercase letters. • Running Example 2 (Abstracted Sentences) -S3': X not containing only Y.

-S4': X containing only Y.

Note that after abstraction, the meaning of sentences in the first example are almost preserved while in the second one, the most semantics are abstracted, remaining only the logical related meaning.

. Regular expressions are widely used in practice. Let Σ be a finite alphabet of symbols, a word is a finite sequence of symbols chosen from this alphabet, and the set of all words over Σ is denoted by Σ * . For example, 01101 is a word from the binary alphabet Σ = {0, 1}. The empty word and the empty set are denoted by and ∅, respectively. Regexes over Σ are defined inductively as follows: , ∅, ∈ Σ, and [ ] where ⊆ Σ are regular expressions; for regular expressions 1 and 2 , the disjunction 1 | 2 , the concatenation and { , }, respectively. 1 { , ∞} is often simplified as 1 { , }. For the running examples, the sentences are tokenized and fed into the state-of-the-art model [64] , resulting in the following regexes:

• Running Example 1 (Transformed Regexes)

denotes an uppercase letter, the quantifier {0, 3} represents the object before it (i.e., [A-Z] in the example) could appear zero to three times. While the quantifier {3, } represents the object could appear three or more times. One may notice that after transformation, the abstract symbol X in the second running example is not appear in the regex R4. It is because for the non-terminals which describe part-whole relations [58, 59, 76] such as inclusion (e.g., "contain") or exclusion (e.g., "exclude"), the preceding terminal is regarded as the collection while the latter one represents the part/component inside. For example, the sentence "X containing only Y" (S4') is transformed to [Y] where X is the collection which includes only the abstract symbol Y. Similarly, the terminal "a string" in the first example is also regarded as the collection which is absent in R1 and R2.

Language of regular expression. The term language denotes a set of strings chosen from the set of words Σ * . Let ( ) be the language of regular expression . ( ) can be defined inductively as follows: 

. A Deterministic Finite Automaton (DFA) is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string [41] . It has equivalent expressive power as the corresponding regular expression, which means any regular expression can be converted into a DFA that recognizes the language described, and vice versa. A DFA [41] can be formally defined as a 5-tuple ( , Σ, , 0 , ), where is a finite set of states, Σ is an alphabet, : × Σ −→ is the transition function, 0 ∈ is the start state, and ⊆ is the set of accept states. In Fig. 4 , we visualized the DFAs of two regular expressions ( 1 and 2 ) for better illustration.

In addition, according to §3.3.3, each regular language can be expressed by a regular expression. There exists a unique minimal automaton that accepts the given regular language in a minimum number of states. This minimal automaton is known as a minimal DFA. A minimal DFA is guaranteed to have a regular expression which is semantically equivalent to the minimal DFA [96] . For example, the illustration of the DFAs of the regular expressions 1 and 2 in §3.3.2 are shown in Fig. 4 .

In the two running examples above, the meaning of quantifiers such as "3 or more" and "no more than 3" can be precisely captured with the quantifiers in regex {3,} and {0,3}, respectively. However, as mentioned in §2, some quantifiers and logical relations such as "a few" and "many" can only be approximately captured by regexes or DFAs. Hence, we give an example to illustrate how SemMT can be adapted to handle this situation.

Take the following sentence as an example:

• S5: The U.S. contains a few states which choose to have the judges in the state courts serve for life terms. 5 .

The vague quantifier "a few" is used with plural countable nouns. It emphasizes a small number of objects. To transform this sentence into regex, we first process it via sentence abstraction. The resulting abstracted sentence is :

where the abstract symbol [Y] represents the "states which choose to have the judges on the state's courts serve for life terms", and X, abstracted from the words "The U.S.", represents the collection which includes only the abstract symbol Y. Then, to approximate the semantics of S5 by regex, either over-and under-approximation can be applied. Specifically, the over-approximation enlarges the number of countable objects compared with the original number, while under-approximation underestimates the amount. For the example S5, two possible approximated sentences are as follows 6 :

which can be transformed into the following regexes:

where the quantifiers {3,} and {1,3} in R5 and R5 prescribe the symbol Y could appears more than three times and once to three times, respectively. By doing so, the semantics of sentences like S5 can be approximately captured by regexes, then the semantic similarities can be calculated on top of them, hence the mistranslation can be calculated accordingly. However, the approximated semantics may affect the effectiveness of mistranslation detection to some extent, so we also discussed such influence in §5.3. Note that the conversion of sentences to regexes is a major research problem in natural language processing. We do not make contribution to such conversion in SemMT but leverage existing conversion techniques. Since the effectiveness of mistranslation detection can be affected by the quality of the approximated regexes, we evaluate SemMT based on the sentences with quantifiers or logical relations that can be precisely quantified due to the availability of automated tools for transforming such relations. The finding based on the five large-scale natural language corpora in §2 indicates that a large majority (89.6%) of quantifiers and logical relations can be precisely quantified.

The semantic differences of sentences involving quantifiers and logical relations cannot be adequately captured by SBERT [73] , which is the state-of-the-art metric proposed to measure semantic similarities. As shown in the second example of Fig. 1 , SBERT gives nearly 100% semantic similarity between "at least" and "at most". Such weakness is commonly found when SBERT is applied to the NL-RX-Synth [53] dataset used by the natural S1: A string that contains only 3 or more uppercase letters

S2: A string containing only no more than 3 uppercase letters

A string that contains only 3ormoreuppercase letters.

A string containing only no more than 3 uppercase letters. language processing community for sentences that contain quantifiers and logical relations. We will discuss it in more detail in our experiment. This motivates us to develop new metrics based on regexes and DFAs to better measure the semantic similarities of sentences involving quantifiers and logical relations.

Considering the transformation rules of regular expression, the similarity between two regexes can be calculated by their Levenshtein distance. Specifically:

where ( 1, 2) is a function that computes the Levenshtein distance between regular expressions 1 and 2. Note that in Levenshtein distance calculation, we followed the convention to count terminals (e.g., [ − ] which denotes an arbitrary lower-case letter, and [0 − 9] which represents an arbitrary number from 0 to 9) specified in [53] as distance one. Hence, for the running examples, the regex-based similarities are:

• Running Example 1 (SemMT-R Similarity)

( 1, 2) = 1 − 3/6 = 0.500

• Running Example 2 (SemMT-R Similarity)

Though efficient, this regex-based similarity may fail to capture the semantic difference in some situations [15, 24] . For example, R3 and R4 have semantically opposite meaning, while the Levenshtein distance between them is only 1. Hence, we further explore the measurement of semantic similarity using DFAs, which can capture the languages described by the regexes.

. An alternative to measure similarity between two regexes is to evaluate the Jaccard similarity between their regular languages. This approach can be effective in finding similarity between two finite sets of data, drawn from various application domains [6, 24, 32] . Specifically, given two regexes, one can construct two corresponding semantic equivalent minimal DFAs, then calculate the Jaccard similarity using the following equation [24, 41] :

whether ( 1) and ( 2) are languages of regexes 1 and 2, respectively.

However, an issue with such similarity is that regular languages can be infinite. To address this issue, we adapt ideas from existing works [11, 12, 22, 75] to improve the efficiency. We first define the function ′ ( 1, 2, ) as follows:

where ∈ N. Formally, for a language ( ), let | ( ) ≤ | denote the number of words in ( ) of length at most . Then, we reduce the calculating function ( 1, 2) to calculate the limit of function ′ ( 1, 2, ), as shown below:

(

Finally, the calculation proceeds iteratively until the solution converges to a preset threshold. Note that the threshold is customized to balance the efficiency and effectiveness. In our evaluation, we set the threshold as 0.001. So for the running examples, the DFA similarities between two groups of regexes are as follows:

where 6 and 1 are the resulting s given that the limit of the function ′ ( 1, 2, ) can approximate the DFA similarities over the infinite regular languages.

Hybrid Similarity (SemMT-H). The above described SemMT-R and SemMT-D methods have their own advantages in the measurement of semantic similarities. Regex-based similarity measures the semantics similarity between two regexes in the textual level, while SemMT-D similarity measures it from the perspective of language set. Hence, we propose a hybrid metric to enjoy both advantages by combining SemMT-R and SemMT-D with customized weights. The hybrid similarity is calculated by the following equation:

where is a customized parameter which adjusts the balance of REG-and DFA-based similarity metrics. For example, if is set to be 0.5, the hybrid similarity of the running examples are calculated as follow:

Note that the selection of different values for parameter K can influence the effectiveness of SemMT-H similarity. The larger the parameter K is, the SemMT-H similarity depends more on SemMT-R similarity ( ), or vice versa. To further demonstrate the influence of K on the effectiveness of HYB-based similarity, we conducted an experiment to discuss this issue in §5.5.

Using one of the above regex-based metrics, the semantic similarity between the original and back-translated sentences can be calculated. We then decide whether a given translated sentence is far enough from the original to indicate the presence of a mistranslation. To do so, following an existing study TransRepair [83] , we first conduct the threshold selection through fine-level granularity enumeration guided by the target evaluation metrics (i.e., F-Score), and then detect mistranslations based on the similarity threshold which achieves the optimum performance. Finally, we identify those sentences whose semantics are less similar to the original sentence than the threshold as suspicious mistranslated sentences. The intuition of guiding by F-Score is that it can better balance precision and recall. Besides, the threshold can be customized since the user may prioritize minimizing false positives or maximizing recall depending on their goals. In §4, we show the trade-offs for different threshold values.

If the similarity between the original and back-translated sentences is less than the threshold, SemMT will report it as a suspicious mistranslation. Considering the fact that a forward mistranslation can hardly lead to a reasonable backward translation [80] , some may believe it is enough to only report the forward trip translation (i.e., the pair of original and intermediate sentences). However, according to our analysis and investigation, there are 26% mistranslations detected by SemMT are introduced in the backward trip ( §5.4). Therefore, when the similarity falls beyond the threshold, SemMT will report the original, intermediate and back-translated sentences together.

This section reports the effectiveness of SemMT by studying the following four research questions (RQs):

• RQ1: How effective is SemMT in finding buggy translations? We evaluated the effectiveness of SemMT in terms of accuracy, F-Score, precision and recall compared with other semantic and non-semantic similarity metrics. We also quantified their capabilities in distinguishing buggy and correct translations. • RQ2: Can SemMT outperform the state-of-the-art works? We evaluated the performance using the number of issues detected, precision, recall and F-score for each work. • RQ3: Can SemMT's effectiveness be improved by combining different similarity metrics? Different metrics tend to evaluate similarity from diverse aspects, so we explored combinations of similarities to see whether their effectiveness can be mutually-improve. • RQ4: What is the applicability of SemMT? The applicability of a testing framework is also critical to measure its practical usefulness. So we further investigated the applicability of our framework, i.e., whether it can achieve similar effectiveness on different translators, language pairs and datasets.

We implemented SemMT in Python, and conducted experiments on a machine powered by one Intel i7-8700K@3.7GHz CPU that supports 6 cores, Nvidia GeForce Titan V 12GB VRAM, 64GB memory and Dual Tesla M60 8GB VRAM. The parameter for SemMT-H is set to be 0.5 on the purpose of striking the balance between them. All experiment results have been released for validation [2] . Dataset. Since we mainly focus on the semantics of quantifier and logical relations, the dataset selected should be consistent of sentences conveying such semantics, which is non-trivial for the evaluation. In this study, we adopted two benchmarks, NL-RX-Synth [53] and KB13 [53, 64] as the source of test inputs for SemMT. The two benchmarks are frequently-used [17, 53, 64] in generating regular expressions for natural language description tasks. In these datasets, the words/tokens can be represented as the semantic-equivalent alphabet of symbols, quantities, logic relations or quantitative modifiers. For example, the word "numbers" can be represented by the set of digits ([0-9]), and the semantic relations "A or B" can be represented as the disjunction between them (A | B). In total, NL-RX-Synth comprises 10,000 pairs of English sentences and the corresponding regular expressions. The average character number of sentences in NL-RX-Synth dataset is 66.36, and 40.61 in KB13. For the first three RQs, the test inputs were randomly selected from NL-RX-Synth, while for RQ4, the test inputs were randomly selected from KB13.

Labeling. The output of SemMT is a list of suspicious issues, each of which consists of the original sentence, intermediate translation and target translation. Two of the authors inspected all the results separately and discussed all the inconsistent labels until convergence. Besides labeling whether the RTT is correct or not, we also labeled whether the forward and backward translations are correct. In addition, to compare with existing the state-of-the-art approaches, we also labeled the issues reported by each of them. The output issues of existing approaches are a list of suspicious issues including the original, intermediate and back-translated sentences.

Model Training. We adopted the state-of-the-art regex synthesis work [64] to transform natural language to regex, which uses reinforcement learning to train a sequence-to-sequence model. In addition, to make it better fit into our work, we proceeded with the grammar checking and data augmentation to the original NL-RX-Synth dataset in order to improve accuracy and enlarge the vocabulary size. Specifically, to augment data, we selected one fourth of the sentences and then replace with synonyms. Note that a manual checking of the replaced synonyms is necessary because we need to ensure the semantic meaning is maintained after the replacement. We also confirmed with a linguist if the semantics are preserved after the synonym substitution. After augmentation, 13,588 sentences were obtained with an average vocabulary size of 123. We then split the training, validation and testing set by 80%, 5%, 15% respectively. The model was trained for 30 epochs in total, achieving a 90.93% accuracy on the test set. To answer RQ4, we also perform training on the KB13 dataset using a process similar to that on the NL-RX-Synth dataset. The resulting model can achieve the accuracy of 77.67% on the test set. 7 Similarity Metrics. Various similarities have been used by recent works to detect translation errors [36, 39, 83] and estimate translation quality [18, 51, 56, 57, 57, 60, 63, 69, 88] . We selected three syntactic-based similarities that are recently used by related works [39, 83] and one state-of-the-art semantic-based similarity [73] as baselines to compare the accuracy, precision and recall of bug detection with the three similarities supported by SemMT.

• Levenshtein-based similarity (LEVEN). It is a way of quantifying how dissimilar two strings are by calculating the Levenshtein distance (a.k.a., Edit distance), i.e., counting the minimum number of operations required to transform one string into the other [74] and normalizing it in the same way as [35, 83, 94] • Dependency-based similarity (DEP). Dependency relations describe the direct relationships [16] of strings. We evaluated the distance between two sets of dependency relations by summing up the absolute difference in the numbers of each type of dependency relations as described in a recent work [39] . • BLEU-based similarity (BLEU). The BLEU (BiLingual Evaluation Under study) [63] aims to automatically evaluate machine translation quality by checking the correspondence between the output of machines and that of humans [23, 63] . For one target sentence, the score of BLEU is calculated by comparing it to a set of good quality reference translations. The details of BLEU can be found in [63] . • SBERT similarity (SBERT). We also used the state-of-the-art sentence-level semantic approach, SBERT [73] , a refined version of BERT [28] , as a baseline for semantic similarity metric in our evaluation, because it performs the best in RTT quality estimation as [60] suggests.

Comparisons. In the experiment, we compared SemMT with SIT [39] , TransRepair [83] and PatInv [36] . These works utilize different methods to generate mutants of the sentences under test, then detect inconsistencies according to different metrics. In particular, SIT [39] generated syntactically equivalent mutants using BERT by replacing noun and adjective words in sentences. Similarly, TransRepair [83] generated mutants by replacing nouns, adjective words and numbers with their synonyms and further construct the word-mutant pair. For comparison, we implemented these works to generate mutants either using their release source code [39] or by carefully following the explanations in their paper [36, 83] . We adjusted the parameters for these works following the original strategies published in their papers based on our dataset. TransRepair [83] used 0.9 as the minimum cosine similarity of word embeddings to generate word pairs. In our experiment, we lowered the threshold to 0.8 in order to generate sufficient number of pairs. For SemMT, since it is not originally designed for mutant generation, we adapted similar mutant generation processes in the baselines by replacing the nouns, numbers and relational adverbs with their synonyms. We also manually validate the generated mutants to preserve the semantics after synonym substitution. The manual checking process was under the guidance of a linguist to reduce the threat to the reliability of experimental results that may be introduce. and removed semantically-changed mutants by manual check. For each sentence under test, SemMT generated up to two mutants.

Note that since SemMT adopts the RTT paradigm, the reported issues may be caused by either the forward and backward trips. Therefore, we labeled both translation trips, and for fairness, the comparison only focused on the correctness of the forward translation while removing those whose backward translations are incorrect.

Evaluation metrics. To evaluate the effectiveness, we adopted accuracy, precision, recall and F1-Score (abbrev. F-Score). Given the number of true positives (TPs, a TP refers to a mistranslated sentence that is reported to be a mistranslation), false positives (FPs, a FP refers to a correctly translated sentence that is reported to be a mistranslation), true negatives (TNs, a TN refers to a correctly translated sentence that is not reported to be a mistranslation) and false negatives (FNs, a FN refers to a mistranslated sentence that is not reported to be a mistranslation), the metrics are defined as follow:

• Accuracy: the proportion of correctly reported (whether the translation is correct or not) sentences.

• Precision: the proportion of real mistranslations over the reported mistranslations.

= + (7)

• Recall: the ratio of reported mistranslations over all the real mistranslations.

• F-Score: twice the multiplication of precision and recall divided by the sum of them.

To evaluate the effectiveness of our SemMT, we randomly sampled 500 sentences from the NL-RX-Synth dataset, applied the round-trip translation and collected the translation results. 8 We then transformed both the original sentences and the round-trip translation results into regular expressions by the trained transformation model. This is described earlier in the experiment setup, and the three SemMT similarities (SemMT-R, SemMT-D, SemMT-H) were then calculated based on regular expressions. Other metrics were calculated on the original and target sentences. Worth mentioning is that in order to evaluate all four evaluation metrics, we labeled all 500 translations following the previously mentioned labeling process. Table 2 shows the comparison of the average and highest accuracy, recall, precision and F-Score of different similarity metrics against thresholds (from 0.1 to 1.0 with step 0.01). Generally, the three similarity metrics in SemMT perform better than the other similarity metrics regarding all four measurements. Moreover, SemMT-D 

Precision Recall F-Score Since the effectiveness of these metrics are affected by customized threshold, we also illustrated how the performance of these evaluation metrics vary against different thresholds. Specifically, we normalized all similarity values, and set the threshold from 0.0 to 1.0, with step 0.01. In Fig 5, we presented the trends on accuracy, precision, recall and F-Score. Overall, all three of our similarity metrics (drawn in red) outperform the others in terms of accuracy, precision, recall and F-Score for most of the threshold values. Apart from our three similarities, most times LEVEN outperforms other similarities. Although SBERT measures semantic similarity, its performance is not as good as expected.

Summary of Findings Related to RQ1: The three similarity metrics used in SemMT are effective for mistranslation detection. They outperform the other metrics against almost all threshold values. Specifically, in terms of precision, recall and F-Score, our metrics achieve an increase of 13%, 30% and 23% compared with the highest value achieved by other metrics.

In this section, we compared SemMT with SIT [39] , TransRepair [83] and PatInv [36] on mutant generation and the effectiveness of bug detection in terms of accuracy, precision, recall and F-Score under the En-Zh language setting on the Google translator. We randomly selected 200 sentences from the dataset and used them as sentences to be tested.

We first generated and filtered out mutants in the way as described in the original paper of baselines [36, 39, 83] . The numbers of generated mutants are listed in Table 3 . These mutants are generated using their own generation approaches, and after filtering, there are 223 to 452 mutants left for each work. Note that the number of filtered mutants does not necessarily indicate a better capability on bug detection, it is depended on different strategy of mutant generation, the effectiveness is continued to be evaluated. Specifically, we evaluated the effectiveness of each method in terms of the four evaluation metrics. The results are presented in Table 3 achieves the optimum performance with respect to F-Score is chosen for each work, as listed in Table 3 . Note that the threshold of SIT is the distance between the dependency parse trees, while PatInv is not tuned by the threshold. We choose two most well-performing (with the highest F-Score at 0.82 and 0.82 with thresholds 0.963 and 0.906, respectively) metric values out of the four in TransRepair, which are the Levenshtein-(denoted as "ED" in [83] ) and BLEU-based method, written as TransRepair(L) and TransRepair(B). For TransRepair(L) and (B), a buggy translation issue is reported when the metric value is smaller than or equal to the selected thresholds, For SIT, a buggy translation issue is reported when the metric value is larger than or equal to the selected threshold. According to Table 3 , SemMT achieves the highest accuracy and F-Score compared with existing works, with a similar number of issues detected. In particular, the highest accuracy achieved by SemMT (74.1% by SemMT-H) is 34.2% higher than the highest accuracy achieved by TransRepair(B) (55.2%), and the highest F-Score (56.3%) achieved by SemMT (SemMT-R) is 15.4% larger than the highest value achieved by SIT (48.8%). In addition, although SIT achieves the best recall (93.2%), its precision is relatively low (33.1%). The highest precision is achieved by PatInv (56.5%), yet the number of issues identified is small (16) . In constrast, SemMT (SemMT-D) achieves a comparatively high recall (85.5%) with 150 (166 -16 = 150) more issues reported.

Moreover, we plotted the correlation between precision and the number of mistranslation detected in Fig 6 as threshold varies. A translation is regarded as a candidate mistranslation if its similarity is smaller than (for the thresholds of distance, it should be larger than) or equal to the given threshold. The X and Y axes represent the number of issues detected and precision, respectively. The threshold setup for each method is proceeded as follow: For SemMT and TransRepair, we normalized the similarity values to the range of [0,1], and set the threshold from 0.0 to 1.0, with a step of 0.1. The threshold of distance for SIT ranges from 1 to 17. No threshold is required for PatInv. Therefore, there are 11 dots for SemMT-R, SemMT-D, SemMT-H, TransRepair(B) and TransRepair(L), 17 dots for SIT and 1 dot for PathInv. In the figure, we did not denote every threshold for each dots, while the threshold values can be implied -the closer the dot is to the y-axis, the larger the similarity threshold (the smaller the distance threshold).

According to Fig 6, we can see that there is a trade-off between the precision and the number of mistranslation issues reported, i.e., with the similarity threshold increases, the more translations are regarded as mistranslation (i.e., the larger the number of issues reported), while the more false positives may be involved, resulting in a lower precision. To better illustrate such balance, we set precision and the number of mistranslation reported as axes, and the more the dot is closed to the top-right corner, the more the result strikes the balance. Overall 

To answer RQ3, we analyzed whether the mistranslations reported by different metrics overlap. We also explored whether the combination of metrics can improve the performance with respect to accuracy, F-Score and the number of issues detected. As previously mentioned, a metric's performance varies with threshold values. Therefore, we chose a threshold value that maximizes its performance based on the largest product of true positives and false negatives for each metric in order for fair comparison. Fig. 7 compares the number of bugs uniquely and commonly detected by three semantic-based metrics (i.e., REG, DFA, HYB) with other metrics. As shown in the figure, DFA detects the most mistranslations (213) compared with other metrics, with 40 (i.e., 213 -173) more than the second most. Besides, though the total number of mistranslations reported by REG and HYB are not the most, the number of unique mistranslations (i.e., mistranslations that can only be detected by one metric) are high, with 19 and 18, respectively. Such results also reveal that existing metrics mildly complement each other since they can detect different mistranslations, indicating that their combination may lead to improvement. Motivated by this, we studied if the performance of SemMT can be boosted by combining it with other existing techniques. Specifically, we adopted a simple strategy which assumes a translation to be buggy if either of the two combined metrics report it as a mistranslation. The experiment result is illustrated in Fig. 8 . The heatmaps show the increase of the number of issues, ratio of accuracy and F-Score accordingly achieved by different combinations. The value in each grid (e.g., v[i][j] in i-th row, j-th column) represents the improvement to the i-th similarity metric by combing with the j-th metric. The greatest improvement is achieved when LEVEN is combined with DFA, detecting 102 more issues and achieving 16% improvement in F-Score. In particular, a combined use of any of our three metrics with an existing one can detect 38 to 102 more issues. Our metrics also detect 9 to 79 more mistranslations when combined with another existing metric. Even for DFA which has already reported 213 issues, a combination with SBERT can help to detect 24 more bugs. The F-Score of all similarity metrics are mostly improved when a metric is combined with another one. Finally, we combined all 7 metrics and found that 246 bugs can be found, with a 1% increase in F-score.

Improvement on Accuracy Improvement on F-Score Summary of Findings Related to RQ3: The combination of different metrics can improve the effectiveness mutually to a large degree in terms of number of mistranslations reported, accuracy and F-Score. In particular, DFA boosts the performance of other similarity metrics the most, with 102 more mistranslations detected and 16% higher F-Scores achieved. Besides, DFA can, in turn, be boosted by SBERT with 24 more mistranslations.

In the above RQs, we evaluate the effectiveness of SemMT by testing Google translator under the EN-ZH (i.e., English-Chinese) language setting. In the following, we present three experiments to further investigate whether similar effectiveness can be observed on other popular translators, language pairs and test sets, respectively.

Our first experiment is to repeat the experiment in RQ1 by replacing the Google translator with Microsoft Bing Translator. Specifically, we randomly selected 100 sentences as test inputs from the NL-RX-Synth dataset, applied the round-trip translation on the Bing translator and collected the translation results. 9 We repeated the labeling procedure as described in § 4.1 and calculated similarities across thresholds 0.0 to 1.0 at the step of 0.01, making sure the experiment setup is the same as that of RQ1 apart from the translator under test.

The experiment result is shown in Fig. 9 . We can see that SemMT-D and SemMT-H outperform other metrics on both accuracy and F-Score for most threshold values. Specifically, SemMT-R achieves the highest accuracy (76%) and F-Score (84%) when the threshold is above 0.9. Among the existing metrics, LEVEN and DEP achieve the highest accuracy and F-Score than others in general, while SBERT is the least effective one for most threshold values. The result also shows that changing the translator under test cast little impact on the effectiveness of the three SemMT metrics. Specifically, different variants of SemMT still outperform existing baselines significantly while SemMT-D performs the best over most threshold values.

We also examined whether the use of language pair would affect the effectiveness of SemMT. Specifically, besides the translation between English and Chinese, we conducted another round-trip translation between English and Japanese using the Google translator on 100 randomly selected sentences from the NL-RX-Synth dataset. 10 The experiment results are illustrated in Fig. 10 . The result shows that our three metrics outperform others for most of the thresholds. And among our three similarity metrics, SemMT-D outperforms SemMT-R and SemMT-H at most times. Among other existing metrics, LEVEN outperforms other existing metrics in terms of accuracy and F-Score, while DEP and SBERT reach the lowest accuracy and F-Score, respectively. The result also echos that in Fig. 5 , indicating that the change of language pair has little impact on the effectiveness of our three SemMT metrics, i.e., they outperform existing baselines significantly while SemMT-D achieves the best overall effectiveness. Furthermore, as compared with the result of RQ1 (as shown in Fig. 5) , SemMT shows similar effectiveness over the thresholds when changing the language pair, which indicates the effectiveness of SemMT may hold under the change of language pairs. 4.5.3 Impact of Test Dataset. Finally, we explored whether similar effectiveness can be achieved given test inputs extracted from other datasets. We thus randomly selected 100 sentences from the KB13 dataset to proceed the evaluation. The round-trip translation was conducted on the Google translator between English and Chinese. 11 The result is illustrated in Fig. 11 . As shown in the red lines, we can see that three metrics of SemMT outperform 9 The translation results were collected on March 11, 2021 on Microsoft Bing translator. 10 The translation results were collected on March 11, 2021 on Google translator. 11 The translation results were collected on March 11, 2021 on Google translator. other metrics as the threshold changes. Specifically, SemMT-D achieves the highest effectiveness on average, while SBERT and BLEU are less effective among these similarity metrics. In addition, compared with the results displayed in Fig. 5 , our three metrics follow similar trends on both datasets, indicating the potential applicability of applying SemMT on various datasets. While other existing metrics (such as LEVEN, SBERT and BLEU) show apparently less effectiveness in terms of F-Score than that on NL-RX-Synth dataset.

Summary of Findings Related to RQ4: The applicability of SemMT has been evaluated by changing the translator (i.e., Bing translator), language pair (English-Japanese) and test set (KB13 [53, 64] ). The results obtained are similar with those in RQ1, which indicate that SemMT is applicable to performing testing under different settings, and three similarity metrics of SemMT outperform the existing ones for most thresholds.

We made three observations on SemMT's performance from our experiments. First, the experiment results reveal that our proposed similarity metrics (i.e., SemMT-R, SemMT-D and SemMT-H) outperform the baselines on a wide range of threshold values in terms of accuracy and F-Score, etc. Second, compared with other state-of-the-art works, SemMT offers a better balance in precision and number of mistranslation detected. It also achieves an improvement of 34.2% and 15.4% on accuracy and F-Score, respectively, with a similar number of issues detected. Third, we investigate any potential improvement that can be achieved by combining different metrics, and find that SemMT-D can boost the performance of other similarity metrics with 102 more issues detected and 16% higher F-Scores achieved. The three SemMT similarities have their own merits. For best precision (> 0.8), SemMT-R with small threshold (< 0.1) is a good choice. For best recall (> 0.8), SemMT-H with high threshold (> 0.8) and a K value of 0.5 is recommended. For best F-Score, SemMT-D outperforms the other two metrics in a wide threshold range. One may switch between the three similarity metrics and adjust the threshold value according to the needs of an application.

To follow up on RQ3, since each similarity metric tends to capture certain aspect of mistranslations, we then analyzed the types of mistranslations detected by each metric for further investigation. We manually labeled the type of mistranslations according to the existing work [39] , which has concluded five types of mistranslations (i.e., Under-Translation, Over-Translation, Word/phrase Mistranslation, Incorrect Modification and Unclear Logic).

The statistics are listed in Table 4 . We can see that for the first three types, the number of mistranslations detected by each metric type are relatively similar, while for "Unclear logic" and "Modification", the number varies a lot. As such, we further subdivided these two mistranslation types to better characterize these two categories of mistranslations detected by SemMT. These new subcategories are motivated by the mistranslations detected in sentences that are mainly logic-or quantifier-related. We subdivide "Unclear Logic" (i.e., all the tokens/phrases are correctly translated but the sentence logic is incorrect [39] ) into three subcategories.

1 Incorrect Order. If all tokens and phrases are translated correctly, yet the order of tokens/phrases are organized in different orders after translation, it is an incorrect order mistranslation. For example, as shown in Table 5 (Example 1), in the original sentence, the string "dog" is arranged before the string "truck" or a letter , while after translation, the letter may be arranged before string "truck". 2 Incorrect Affiliation. If all tokens and phrases are translated correctly, yet the affiliation relation is incorrect, it is an incorrect affiliation mistranslation. As shown in Table 5 (Example 2), the string contains letters and lowercase letters in the original sentence, while after translation, the affiliation is missed. 3 Incorrect Semantics. If tokens/phrases are correctly translated, yet the semantic logic of the original sentence is not preserved after translation, it is an incorrect semantics mistranslation. An example is presented in Table 5 (Example 3), the original semantic logic describes that the lines contains 5 or more numeric characters, while after translation, the semantic meaning is changed, describing the number of lines instead of the lines themselves.

For "Incorrect Modification" (i.e., if the modifier modify the wrong element), we subdivide it into two subcategories according to the type of modifiers.

1 Incorrect Qualitative Modification. If a qualitative modifier modifies the wrong element in a sentence, it is an incorrect qualitative modification mistranslation. An example is illustrated in Table 5 (Example 4). The attribute "string" modifies "dog" in the original sentence, while after translation, the modifier of "dog" becomes "numeric character". 2 Incorrect Quantitative Modification. Similarly, if a qualitative modifier modifies the wrong elements, it is an incorrect quantitative modification mistranslation. An example is illustrated in Table 5 (Example 5). The quantitative modifier "at least once" modifies different elements after translation. False Positives Caused by Singularity and Plurality. When we analyzed the mistranslations reported by SemMT in dataset used in §4.2, there are 148 mistranslations among 500 are plural-related, i.e., nouns are the same in both singular and plural forms in Chinese, while they are in different forms in English. As a result, the singularity/plurality is mistakenly missed or imposed during translation. Such plural-related mistranslations are commonly-seen across languages. For example, languages such as Slovenian, Russian, and Welsh have several plural forms, while languages such as Chinese and Japanese do not have counterparts to the forms of singular and plural in languages like English. Though minor and easy-to-neglect, in aid of regex and DFA, our SemMT is able to capture such subtle differences to some degree, nevertheless false positives may be caused due to this reason. False Positives Caused by Inaccurate Transformation From Natural Language to Regular Expression. The inaccurate transformation from natural language sentences to the corresponding regexes may also lead to false positives of SemMT. The main reason is caused by the existence of out-of-vocabulary words in the round-trip translated sentences, leading to the inaccurate transformation from natural language to regex. To alleviate such concerns, we enlarged the vocabulary of the training dataset in order to achieve accuracy as high as we can. Specifically, we collected the translation results from the Google translator, and obtained a list of parsed tokens. lines with the string "dog" before the string "truck" or a letter. 在字符串"truck"之前的字符串"dog"或字母 Example 1 (Incorrect Order) the string "dog" or letter before the string "truck".

strings with a letter followed by a lower-case letter, zero or more times.

Example 2 (Incorrect Affiliation) string , followed by a letter with a lowercase letter, zero or more times.

lines with a number , 5 or more times. 行数大于等于5次

Example 3

(Incorrect Semantic) the number of rows is greater than or equal to 5.

lines with the string "dog" before a vowel or a numeric character.

Example 4 (Qualitative

string before vowel or numeric character "dog".

lines starting with a lower case letter at least once or a capital letter. 以小写字母开头的行或至少一个大写字母的行 Example 5 (Quantitative

lines beginning with lowercase letters or lines with at least one capital letter

Then we augmented the training data by synonym substitution (i.e., replacing the tokens in the original training dataset by the tokens that are not in the original dataset but are derived in the returned dataset.).

In §3.3.5, we explained how can the semantics be captured over-or under-approximately. However, if we apply the approximated semantics to detect mistranslations over original and translated sentences, the derived results are likely to be unreliable due to the approximated semantics. Take the example (i.e., S5) and its back-translated sentence (i.e., T5) in §3.3.5 as examples 12 :

• S5: The U.S. contains a few states which choose to have the judges on the state's courts serve for life terms.

• T5: The United States contains several states, which choose to let judges in state courts serve for life. After sentence abstraction, we will obtain the following abstracted sentences:

Since "a few" and "several" are vague quantifiers for which the semantics are hard to be precisely quantified, we approximate their semantics using regexes as follows:

{5,} After calculation, the SemMT-D semantic similarity is up to 0.957, meaning that there is little difference after translation. However, people tend to believe "a few" is less than "several" [40] , while such semantic difference between these two vague quantifiers is hard to be captured after approximation. As a result, it may be harder to capture semantic difference due to the wider range of quantification. Table 6 . Statistics of Buggy Trip Localization. The first three major columns denote the average similarity scores or distances over correctly/mistranslated sentences across different translators. The last major column shows the number and accuracy of correctly identified buggy trip using different similarity metrics. The values in bold represent the maximum number of correctly located buggy trip.

AveSim_BuggyFW Therefore, we discuss one possible solution, i.e., to quantify vague quantifiers more precisely by considering the common practice and the context of the sentence. Plenty of studies have been performed on quantifying vague quantifiers [13, 42, 49, 79, 92] from logical, linguistic and psychological aspects. For example, "a few" and "several" are believed to be less than the quantifier "a half". And being aware of the fact that the number of states in the U.S. is at most 50, then the quantification-related semantics of the above sentences (S5' and T5') can be more precisely quantified by the following regexes:

where the quantifiers "a few" and "several" are quantified to be {3,25} and {5,25}, respectively. By doing so, the SemMT-D similarity between them is 0.84, where the subtle semantic difference can be detected.

Despite the advantage of round-trip translation, it has been criticized for not testing one translation system but two [60, 80] . Hence, in this section, we discuss a potential solution to localize buggy trip (the trip that produces wrong translation) using the idea of cross reference. The intuition is as follows: if the translation returned by one translator is different from the results of other translators, it is likely to be incorrect. The less similar with other translation results, the more likely the original sentence is error-prone.

For better understanding, we analyzed 500 pairs of round-trip translations used in §4.2 and manually identified the buggy trip for all the 265 mistranslations. If both trips are mistranslated, we regarded it as a forward mistranslation because it is where the mistranslation was first introduced. Apart from the 148 plural-related mistranslations, there are 87 forward mistranslations and 30 backward mistranslations in total. For cross reference, we used the Microsoft Bing and Youdao translator. Then we conducted preliminary statistics, calculating the average similarity scores using four similarity metrics (i.e., LEVEN, DEP, BLEU and SBERT) across two translation trips (i.e., forward trip translating from English to Chinese, and backward trip from Chinese to English) for correctly and incorrectly translated sentences.

The result is shown in Table 6 . The values in the first three major columns (i.e., AveSim_Correct, AveSim_BuggyFW) and AveSim_BuggyBW denote the average similarity scores of the forward translations (columns Sim_FW) in Chinese and the backward sentences (columns Sim_BW) except for DEP which calculates the distance. The higher the similarity scores, the more similar the sentences that are translated by different translators. We can see that similarity scores are not identical in different translation trips, and on average, the similarity scores of the correct translated sentences (column AveSim_Correct) are higher than that of mistranslated ones (column AveSim_BuggyFW and AveSim_BuggyBW). For example, for LEVEN, the average similarity scores in the forward and backward trips are 0.50 and 0.64, respectively. And the average similarity on correctly translated sentences is 0.50, while if the forward translation on Google is mistranslated, the similarity between sentences translated by Google was lower (with 0.39) than 0.50. Similar patterns can be observed for other similarity metrics. To sum up, we made three observations: (1) The similarity score for correct or incorrect translations vary from languages. (2) Within the same language, the similarity scores also vary from different similarity metrics. (3) The similarity scores of the correct translated sentences are higher than that of mistranslated ones on average.

On top of these observations, we tried the following strategy: given an original sentence which has been mistranslated in either trip, if the difference between the average similarity scores (i.e., as shown in Table 6 ) for the correct translations and the forward similarity scores is larger than or equal to backward trip, then this mistranslation is considered as a forward-trip mistranslation, otherwise a backward-trip mistranslation. The result is listed in the last three columns in Table 6 . We can see that the capability of differentiating mistranslation trip differs for different similarity metrics. Specifically, SBERT achieved the highest accuracy (76%) by correctly identifying 74 forward mistranslations over 87 and 15 backward mistranslations over 30, while the syntactic-based DEP has the lowest accuracy (38%). In addition, the result indicates different similarity metrics tend to have diverse capabilities in identifying different translation trips. For example, SBERT founds the most forward buggy trip (74) while BLEU performs better in identifying the buggy backward trip.

On top of the remarkable results achieved by SemMT, we explore a further question: whether the setting of K will affect the effectiveness of SemMT-H? To answer this question, we set K from 0.1 to 0.9 with a step of 0.2 (i.e., K is set to be 0.1, 0.3, 0.5, 0.7 and 0.9), and examined the accuracy and F-Score on SemMT-H with different values of K against all the threshold values. We can see from Fig. 12 that the general trends of accuracy are similar under various fluctuations, while the trends of F-Score vary largely. Specifically, for accuracy, choosing a smaller K value in the hybrid metric would achieve better performance when the similarity threshold is less than 0.4 or larger than 0.8. For F-Score, a small K value in the hybrid metric outperforms that using a large K value when threshold is less than 0.8, and the reverse situation is observed when using a threshold beyond (0.8 to 1.0). Furthermore, considering accuracy and F-Score together, SemMT-H similarity offers a better performance over a wide threshold range when K assumes a value between 0.3 and 0.5. And if SemMT-H similarity is to be used under a small threshold value where precision takes priority, a small K such as 0.1 is preferable.

Machine translation testing aims at finding sentences that trigger translation errors [36] . Pesu et al. [66] first applied metamorphic testing on machine translation systems. They proposed a Monte Carlo method to avoid round-trip translation by selecting eight target languages given the fixed source language, English. Under the factorial design and analysis, they evaluated the performance of translation over different combination of the source and target languages. After that, more metamorphic testings were developed. Sun et al. [99] designed straight-forward metamorphic relations focusing on short sentences in subject-verb-object structure (e.g., "Mike loves to eat KFC" and "Mouse loves to eat KFC"). They generated test inputs by replacing human names before "likes" or "hates", and brands after them. Wang et al. [91] detected under-and over-translation in the absence of reference translation. By checking the frequency of occurrence and learning mappings between bilingual words and phrase, their work is able to detect these two kinds of mistranslations efficiently and scalably. Later, He et al. [39] and Sun et al. [83] on the assumption that similar sentences should have similar translation results. To be more specific, SIT [39] generated similar testing inputs by substituting one word in a given sentence, such that the generated inputs are semantically-similar and syntactically equivalent as the given sentence. Then they further reported the suspected issues if structures of translated sentences are different. Similarly, TransRepair [83] conducted mutation testing via context-similar word replacement. The intuition is that the translations of both original sentence and mutants should be consistent except for the changed token. While PatInv [36] , on the other hand, considered pathological invariance: sentences of with different meanings should not have the same translation. Following this intuition, they generated syntactically similar but semantically different sentences by either replacing one word with a non-synonymous word using masked language models or removing one word based on its constituency structure. They further detected the potential mistranslations with closer textual similarity. We can see that the existing techniques mainly focused on textual or syntactical level, while the preservation of semantics during translation has not gained enough attention. Therefore, SemMT aims at filling this gap by evaluating whether the semantic meaning is preserved during translation, complementary to the existing works.

Evaluating on adversarial examples has become a standard procedure to measure the robustness of deep learning models [33] . Adversarial examples are inputs designed to slightly manipulate the real-world examples such that a well-trained machine learning model performs poorly against these adversarial examples [30] . In general, these works mainly fall into two categories: black-box and white-box methodologies. In particular, for black-box adversarial samples generation, they assume the implementation of translation system is agnostic. Belinkov et al. [10] showed that character-level machine translation systems are overly sensitive to random character manipulations, such as keyboard typos. [30] investigated adversarial examples of both untargeted and targeted attack for character-level neural machine translation in a white-box manner. They transferred this problem into an optimization problem, then generated the adversarial examples utilizing gradients of translation models to inflict more damaging manipulations for a larger decrease in the BLEU score or other target metrics.

Unlike machine translation testing, quality estimation considers beyond correctness -it aims at deriving similar estimation results made by humans. Traditionally, it estimates the required amount of post-editing efforts for converting the given translation result to the reference translation [78] . In the recent decade, the trend is to find effective quality estimation metrics which can directly provide scores to the translation result without human-written reference [31] . To alleviate the manual effort in providing reference translation, round-trip translation (RTT) has been proposed [45, 77] . The general idea of RTT is to use the original sentence as reference, compared the translated sentence with it, and calculated estimation metrics such as BLEU score to show the correlation with human judgement. In early 2010s, Aiken el al. [4] manually reassessed the correlation on input sentences and translated outputs and reassessed this correlation in a RTT manner. The result implied that if a suitable semantic-level metric is provided, RTT-based method can be reliably used for machine translation evaluation. They also pointed out that RTT quality might reflect the general quality of machine translation system used over the length of a longer text or multiple language pairs. Afterwards, with the emergence of BERT [28] and SBERT [73] , semantic similarity on both word-and sentence-level can be better captured [14] . Moon et al. [60] then revisited RTT for quality estimation and achieved the highest correlations with human judgments compared with the state-of-the-art works, indicating that RTT-based method can be used to evaluate machine translation systems when semantic similarity is considered. By observing the correlated results, they illustrated the robustness of the choice of backward translation system on RTT-based quality estimation. It motivates us to adopt RTT and develop three semantic similarity metrics.

The semantic metrics used in SemMT rely on the transformation of regexes from natural language. The precision of regex synthesis may affect SemMT's performance. If the derived regexes are imprecise, our proposed semantic similarity metrics might not accurately measure the real semantic relationship between source and translated sentence. To alleviate the influence of imprecise regex transformation, we adopted the state-of-the-art model and trained the model on augmented dataset to minimize the inaccurate prediction caused by out-of-vocabulary problem.

Our analysis on false positives reported by SemMT shows that the performance of SemMT can also be influenced by invalid synthesized regexes such as unpaired or mis-paired brackets (e.g., [a-z{0,3} is an invalid regex due to lacking of a square bracket) or incorrect combination of operators (e.g., +{0,3}). The invalid regex cannot be transformed into DFA, making the similarity computation of DFAs infeasible. To solve this problem, we performed post processing of transformed regexes to mitigate this problem, pairing/repairing the unpaired/mis-paired brackets.

Moreover, the precision cannot be ensured if the original sentence presents unclear/ambiguous logic [53, 64, 96] . Even though the round-trip testing that we used is to compare two sentences in the same language, the ambiguity of translation between the two languages in the forward trip and backward trip can influence our test results. The consequences of ambiguity include mistranslation of sentences and incorrect regex transformation (i.e., the regex is mistakenly transformed, leading to the change in semantic meaning after transformation). Besides, the reliance on published transformation tools from natural language to regex limits our evaluation to English datasets.

Finally, the English proficiency of authors may cast impacts on the evaluation. To alleviate such impacts, we consulted a linguist to ensure the semantics are preserved during the synonym substitution (during dataset augmentation and mutation operator construction described in §4.1). In addition, for quantifying vague quantifiers when conducting semantics approximation ( §3.3.5 and §5.3), we also discussed with the expert linguist for confirmation.

In this paper, we proposed SemMT, a semantic-based machine translation testing framework. It tests the semantic similarity during translation, taking the first step to semantic-aware testing approach for translation systems. Specifically, we focused on the semantics of quantifiers and logical relations, which take up a non-trivial ratio in the daily life, and the mistranslation of them may cause severe consequences. Via transforming such sentences into regular expressions, SemMT can capture the semantics of sentences during translation, and detect the suspicious mistranslation by semantic similarity measurements. The evaluation showed that SemMT can achieve higher effectiveness compared with state-of-the-art works, achieving an increase of 34.2% on accuracy. Furthermore, considering the unique mistranslation detection, SemMT can cover the most of the bugs found by existing techniques and can locate 67 additional bugs that are ignored by existing techniques. Furthermore, our exploration also indicated the potential improvement may occur when proper combinations of various similarity metrics are adopted.

Semantic relations and their use in elaborating terminology

The Efficacy of Round-trip Translation for MT Evaluation

Logic in Linguistics

Efficient Exact Set-Similarity Joins

OpusFilter: A Configurable Parallel Corpus Filtering Toolbox

Quantification in natural languages

Generalized quantifiers and natural language

Synthetic and Natural Noise Both Break Neural Machine Translation

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

Inference of concise regular expressions and DTDs

Vague Quantifiers

SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

Exploring regular expression usage and context in Python

A Fast and Accurate Dependency Parser using Neural Networks

Greg Durrett, and Isil Dillig. 2020. Multi-modal synthesis of regular expressions

Enhanced LSTM for Natural Language Inference

Metamorphic Testing: A New Approach for Generating Next Test Cases

Metamorphic testing: A review of challenges and opportunities

Exploring Logically Dependent Multi-task Learning with Causal Inference

Finite state languages

Correlating automated and human assessments of machine translation quality

Similarity in languages and programs

Diarrheal diseases

Palestinian man is arrested by police after posting 'Good morning' in Arabic on Facebook which was wrongly translated as 'attack them

Testing Regex Generalizability And Its Implications: A Large-Scale Many-Language Measurement Study

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Using Structured Events to Predict Stock Price Movement: An Empirical Investigation

On Adversarial Examples for Character-Level Neural Machine Translation

Findings of the WMT 2019 Shared Tasks on Quality Estimation

Database systems: the complete book

Explaining and Harnessing Adversarial Examples

Few, a Few-What's the Difference?

Search Engine Guided Neural Machine Translation

Machine Translation Testing via Pathological Invariance

Logical Inferences with Comparatives and Generalized Quantifiers

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective

Structure-Invariant Testing for Machine Translation

What is a Quantifier?

Introduction to automata theory, languages, and computation

Response quality in survey research with children and adolescents: the effect of labeled response options and vague quantifiers

What is the difference between "few" and "a few

Europarl: A parallel corpus for statistical machine translation

Manual and Automatic Evaluation of Machine Translation between European Languages

Noisy Parallel Corpus Filtering through Projected Word Embeddings

Resolution of quantifier scope ambiguities

Using Semantic Unification to Generate Regular Expressions from Natural Language

An intensional parametric semantics for vague quantifiers

Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation

YiSi -a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources

Quantification as a major module of natural language semantics. Studies in discourse representation theory and the theory of generalized quantifiers

Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge

Housing Needs Study

The Greatest Mistranslations Ever

Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Modal realism with overlap

Wholes and parts: The limits of composition

Revisiting Round-Trip Translation for Quality Estimation

9 Little Translation Mistakes That Caused Big Problems

Facebook apologizes after wrong translation sees Palestinian man arrested for posting 'good morning

Bleu: a Method for Automatic Evaluation of Machine Translation

SoftRegex: Generating Regex from Natural Language Descriptions using Softened Regex Equivalence

A Monte Carlo Method for Metamorphic Testing of Machine Translation Services

Quantifiers in language and logic

Some of Them Can be Guessed! Exploring the Effect of Linguistic Context in Predicting Quantifiers

chrF: character n-gram F-score for automatic MT evaluation

Logic for linguists. Course for LSA institute

Policy Brief: The Impact of COVID-19 on children

A multilingual natural-language interface to regular expressions

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Learning String Edit Distance

A mathematical theory of communication. The Bell system technical journal

A more general theory of definite descriptions. The philosophical review

Evaluation and Usability of Back Translation for Intercultural Communication

A study of translation edit rate with targeted human annotation

Understanding vagueness. logical, philosophical and linguistic perspectives

Australasian Language Technology Association

Zero-shot Learning of Classifiers from Natural Language Quantification

Concepts and semantic relations in information science

Automatic testing and improvement of machine translation

Quantifiers and Cognition -Logical and Computational Perspectives

Parallel Data, Tools and Interfaces in OPUS

The Helsinki submission to the AmericasNLP shared task

Rex: Symbolic Regular Expression Explorer

When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion

How well are regular expressions tested in the wild

Detecting Failures of Neural Machine Translation in the Absence of Reference Translations

Detecting Failures of Neural Machine Translation in the Absence of Reference Translations

How much is 'quite a bit'? Mapping between numerical values and vague quantifiers

Effects of machine translation on collaborative work

Guiding Neural Machine Translation with Retrieved Translation Pieces

Generating Natural Adversarial Examples

SemRegex: A Semantics-Based Approach for Generating Regular Expressions from Natural Language Specifications

Generating Regular Expressions from Natural Language Specifications: Are We There Yet

Metamorphic Testing for Software Quality Assessment: A Study of Search Engines

Metamorphic Testing for Machine Translations: MT4MT

The authors would like to thank the anonymous reviewers for their comments and suggestions. We also like to thank the communicator tutor, Mrs. Shauna Dalton who helped with proofreading and advice on the English usage from the linguistic perspective. 

Semantic (9) 9 9 9 9 9 9 9Modification (