Edinburgh Research Explorer Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars Citation for published version: He, H, Lin, J & Lopez, A 2015, 'Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars', Transactions of the Association for Computational Linguistics, vol. 3, pp. 87-100. Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: Transactions of the Association for Computational Linguistics General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. Apr. 2021 https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/555 https://www.research.ed.ac.uk/portal/en/publications/gappy-pattern-matching-on-gpus-for-ondemand-extraction-of-hierarchical-translation-grammars(f9f6c86d-3559-4697-9beb-0b85977ace12).html Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars Hua He Dept. of Computer Science University of Maryland College Park, Maryland huah@cs.umd.edu Jimmy Lin The iSchool and UMIACS University of Maryland College Park, Maryland jimmylin@umd.edu Adam Lopez School of Informatics University of Edinburgh Edinburgh, United Kingdom alopez@inf.ed.ac.uk Abstract Grammars for machine translation can be materialized on demand by finding source phrases in an indexed parallel corpus and extracting their translations. This approach is limited in practical applications by the computational expense of online lookup and extraction. For phrase-based models, recent work has shown that on-demand grammar extraction can be greatly accelerated by parallelization on general purpose graphics processing units (GPUs), but these algorithms do not work for hierarchical models, which require matching patterns that contain gaps. We address this limitation by presenting a novel GPU algorithm for on-demand hierarchical grammar extraction that is at least an order of magnitude faster than a comparable CPU algorithm when processing large batches of sentences. In terms of end-to-end translation, with decoding on the CPU, we increase throughput by roughly two thirds on a standard MT evaluation dataset. The GPU necessary to achieve these improvements increases the cost of a server by about a third. We believe that GPU-based extraction of hierarchical grammars is an attractive proposition, particularly for MT applications that demand high throughput. 1 Introduction Most machine translation systems extract a large, fixed translation model from parallel text that is accessed from memory or disk. An alternative is to store the indexed parallel text in memory and extract translation units on demand only when they are needed to decode new input. This architecture has several advantages: It requires only a few gigabytes to represent a model that would otherwise require a terabyte (Lopez, 2008b). It can adapt incrementally to new training data (Levenberg et al., 2010), mak- ing it useful for interactive translation (González- Rubio et al., 2012). It supports rule extraction that is sensitive to the input sentence, enabling leave-one- out training (Simianer et al., 2012) and the use of sentence similarity features (Philips, 2012). On-demand extraction can be slow, but for phrase-based models, massive parallelization on general purpose graphics processing units (GPUs) can dramatically accelerate performance. He et al. (2013) demonstrated orders of magnitude speedup in exact pattern matching with suffix arrays, the algorithms at the heart of on-demand extraction (Callison-Burch et al., 2005; Zhang and Vogel, 2005). However, some popular translation models use “gappy” phrases (Chiang, 2007; Simard et al., 2005; Galley and Manning, 2010), and the GPU algorithm of He et al. does not work for these models since it is limited to contiguous phrases. Instead, we need pattern matching and phrase extraction that is able to handle variable-length gaps (Lopez, 2007). This paper presents a novel GPU algorithm for on-demand extraction of hierarchical translation models based on matching and extracting gappy phrases. Our experiments examine both grammar extraction and end-to-end translation, comparing quality, speed, and memory use. We compare against the GPU system for phrase-based translation by He et al. (2013) and cdec, a state-of-the-art 87 Transactions of the Association for Computational Linguistics, vol. 3, pp. 87–100, 2015. Action Editor: David Chiang. Submission batch: 6/2014; Revision batch 12/2014; Published 2/2015. c©2015 Association for Computational Linguistics. CPU system for hierarchical translation (Dyer et al., 2010). Our system outperforms the former on translation quality by 2.3 BLEU (replicating previously-known results) and outperforms the latter on speed, improving grammar extraction throughput by at least an order of magnitude on large batches of sentences while maintaining the same level of translation quality. Our contribution is to show, complete with an open-source implementation, how GPUs can vastly increase the speed of hierarchical grammar extraction, particularly for high-throughput MT applications. 2 Algorithms GPU architectures, which are optimized for mas- sively parallel operations on relatively small streams of data, strongly influence the design of our algo- rithms, so we first briefly review some key proper- ties. Our NVIDIA Tesla K20c GPU (Kepler Gen- eration) provides 2496 thread processors (CUDA cores), and computation proceeds concurrently in groups of 32 threads called warps. Each thread in a warp carries out identical instructions in lockstep. When a branching instruction occurs, only threads meeting the branch condition execute, while the rest idle—this is called warp divergence and is a source of poor performance. Consequently, GPUs are poor at “irregular” computations that involve conditionals, pointer manipulation, and complex execution sequences. Our pattern matching algorithm is organized around two general design principles: brute force scans and fine-grained parallelism. Brute force array scans avoid warp divergence since they access data in regular patterns. Rather than parallelize larger algorithms that use these scans as subroutines, we parallelize the scans themselves in a fine-grained manner to obtain high throughput. The relatively small size of the GPU memory also affects design decisions. Data transfer between the GPU and the CPU has high latency, so we want to avoid shuffling data as much as possible. To accomplish this, we must fit all our data structures into the 5 GB memory available on our particular GPU. As we will show, this requires some tradeoffs in addition to careful design of algorithms and associated data structures. 2.1 Translation by Pattern Matching Lopez (2008b) provides a recipe for “translation by pattern matching” that we use as a guide for the remainder of this paper (Algorithm 1). Algorithm 1 Translation by pattern matching 1: for each input sentence do 2: for each phrase in the sentence do 3: Find its occurrences in the source text 4: for each occurrence do 5: Extract any aligned target phrase 6: for each extracted phrase pair do 7: Compute feature values 8: Decode as usual using the scored rules We encounter a computational bottleneck in lines 2–7, since there are many query phrases, matching occurrences, and extracted phrase pairs to process. Below, we tackle each challenge in turn. To make our discussion concrete, we will use a toy English-Spanish translation example. At line 3 we search for phrases in the source side of a parallel text. Suppose that it contains two sentences: it makes him and it mars him, and it sets him on and it takes him off. We map each unique word to an integer id, and call the resulting array of integers that encodes the source text under this transformation the text T . Let |T| denote the total number of tokens, T[i] denote its ith element, and T[i]...T[j] denote the substring starting with its ith element and ending with its jth element. Since T encodes multiple sentences, we use special tokens to denote the end of a sentence and the end of the corpus. In our example we use # and $, respectively, as shown in Figure 1. Now suppose we want to translate the sentence it persuades him and it disheartens him. We encode it under the same mapping as T and call the resulting array the query Q. Our goal is to materialize all of the hierarchical phrases that could be used to translate Q based on training data T . Our algorithm breaks this process into a total of 14 distinct passes, each performing a single type of computation in parallel. Five of these passes are based on the algorithm described by He et al. (2013), and we review them for completeness. The nine new passes are identified as such. 88 i T[i] 1 # 2 it 3 makes 4 him 5 and 6 it 7 mars 8 him 9 # 10 it 11 sets 12 him 13 on 14 and 15 it 16 takes 17 him 18 off 19 # 20 $ ST [i] suffix ST [i]: T[ST [i]]...T[|T|] 5 and it mars him # it sets him on and it takes him off # $ 14 and it takes him off # $ 4 him and it mars him # it sets him on and it takes him off # $ 17 him off # $ 12 him on and it takes him off # $ 8 him # it sets him on and it takes him off # $ 2 it makes him and it mars him # it sets him on and it takes him off # $ 6 it mars him # it sets him on and it takes him off # $ 10 it sets him on and it takes him off # $ 15 it takes him off # $ 3 makes him and it mars him # it sets him on and it takes him off # $ 7 mars him # it sets him on and it takes him off # $ 18 off # $ 13 on and it takes him off # $ 11 sets him on and it takes him off # $ 16 takes him off # $ 1 # it makes him and it mars him # it sets him on and it takes him off # $ 9 # it sets him on and it takes him off # $ 19 # $ 20 $ Figure 1: Example text T (showing words rather than than their integer encodings for clarity) and suffix array ST with corresponding suffixes. 2.2 Finding Every Phrase Line 3 of Algorithm 1 searches T for all occurrences of all phrases in Q. We call each phrase a pattern, and our goal is to find all phrases of T that match each pattern, i.e., the problem of pattern matching. Our translation model permits phrases with at most two gaps, so Q is a source of O(|Q|6) patterns, since there are up to six possible subphrase boundaries. Passes 1-2: Finding contiguous patterns To find contiguous phrases (patterns without gaps), we use the algorithm of He et al. (2013). It requires a suffix array (Manber and Myers, 1990) computed from T . The ith suffix of a 1-indexed text T is the substring T[i]...T[|T |] starting at position i and continuing to the end of T . The suffix array ST is a permutation of the integers 1, ..., |T | ordered by a lexicographic sort of the corresponding suffixes (Figure 1). Given ST , finding a pattern P is simply a matter of binary search for the pair of integers (`,h) such that for all i from ` to h, P is a prefix of the ST [i]th suffix of T . Thus, each integer ST [i] identifies a unique match of P . In our example, the pattern it returns (7, 10), corresponding to matches at positions 2, 6, 10, and 15; while him and it returns (3, 3), corresponding to a match at position 4. A longest common prefix (LCP) array enables us to find h or ` in O(|Q|+ log |T|) comparisons (Manber and Myers, 1990). Every substring of Q is a contiguous pattern, but if we searched T for all of them, most searches would fail, wasting computation. Instead, He et al. (2013) use two passes. The first computes, concurrently for every position i in 1, ..., |Q|, the endpoint j of the longest substring Q[i]...Q[j] that appears in T . It also computes the suffix array range of the one- word substring Q[i]. Taking this range as input, for all k from i to j the second pass concurrently queries T for pattern Q[i]...Q[k]. This pass uses two concurrent threads per pattern—one to find the lowest index of the suffix array range, and one to find the highest index. Passes 3-4: Finding one-gap patterns (New) Passes 1 and 2 find contiguous phrases, but we must 89 also find phrases that contain gaps. We use the special symbol ? to denote a variable-length gap. The set of one-gap patterns in Q thus consists of Q[i]...Q[j] ? Q[i′]...Q[j′] for all i, j, i′, and j′ such that i ≤ j < i′ − 1 and i′ ≤ j′. When the position in Q is not important we use strings u, v, and w to denote contiguous patterns; for example, u ? v denotes an arbitrary one-gap pattern. We call the contiguous strings u and v of u ? v its subpatterns, e.g., it ? him is a pattern with subpatterns it and him. When we search for a gappy pattern, ? can match any non-empty substring of T that does not contain $ or #. Such a match may not be uniquely identified by the index of its first word, so we specify it with a tuple of indices, one for the match of each subpattern. Pattern it ? him has six matches in T : (2, 4), (2, 8), (6, 8), (10, 12), (10, 17), and (15, 17). Passes 3 and 4 search T for all one-gap patterns using the novel GPU algorithm described below. A pattern u ? v cannot match in T unless both u and v match in T , so we use the output of pass 1, which returns all (i,j) pairs such that Q[i]...Q[j] matches in T . Concurrently for every such i and j, pass 3 enumerates all i′ and j′ such that j < i′ − 1 and Q[i′]...Q[j′] matches in T , returning each pattern Q[i]...Q[j] ? Q[i′]...Q[j′]. Pass 3 then sorts and deduplicates the combined results of all threads to obtain a set of unique patterns. These operations are carried out on the GPU using the algorithms of Hoberock and Bell (2010). Pass 4 searches for matches of each pattern iden- tified by pass 3. We first illustrate with it ? him. Pass 2 associates it with suffix array range (7, 10). A linear scan of ST in this range reveals that it matches at positions 2, 6, 10, and 15 in T . Likewise, him maps to range (3, 6) of ST and matches at 4, 17, 12, and 8 in T . Concurrently for each match of the less frequent subpattern, we scan T to find matches of the other subpattern until reaching a sentence boundary or the maximum phrase length, an idea we borrow from the CPU implementation of Baltescu and Blunsom (2014). In our example, both it and him occur an equal number of times, so we arbitrarily choose one—suppose we choose it. We assign each of positions 2, 6, 10, and 15 to a separate thread. The thread assigned position 2 scans T for matches of him until the end of sentence at position 9, finding matches (2, 4) and (2, 8). As a second example, consider it ? and. In this case, it has four matches, but and only two. So, we need only two threads, each scanning backwards from matches of and. Since most patterns are infrequent, allocating threads this way minimizes work. However, very large corpora contain one-gap patterns for which both subpatterns are frequent. We simply precompute all matches for these patterns and retrieve them at runtime, as in Lopez (2007). This precomputation is performed once given T and therefore it is a one-time cost. Materializing every match of u ? v would con- sume substantial memory, so we only emit those for which a translation of the substring matching ? is extractable using the check in §2.3. The success of this check is a prerequisite for extracting the translation of u ? v or any pattern containing it, so pruning in this way conserves GPU memory without affecting the final grammar. Passes 5-7: Finding two-gap patterns (New) We next find all patterns with two gaps of the form u?v ?w. The search is similar to passes 3 and 4. In pass 5, concurrently for every pattern u ? v matched in pass 4, we enumerate the pattern u?v?w for every w such that u?v?w is a pattern in Q and w matched in pass 1. In pass 6, concurrently for every match (i,j) of u?v for every u?v ?w enumerated in pass 5, we scan T from position j + |v| + 1 for matches of w until we reach the end of sentence. As with the one-gap patterns, we apply the extraction check on the second ? of the two-gap patterns u ? v ? w to avoid needlessly materializing matches that will not yield translations. 2.3 Extracting Every Target Phrase In line 5 of Algorithm 1 we must extract the aligned translation of every match of every pattern found in T . Efficiency is crucial since some patterns may occur hundreds of thousands of times. We extract translations from word alignments using the consistency check of Och et al. (1999). A pair of substrings is consistent only if no word in either substring is aligned to any word outside the pair. For example, in Figure 2 the pair (it sets him on, los excita) is consistent. The pair (him on and, los excita y) is not, because excita also aligns to the words it sets. Only consistent pairs can be 90 lo s ex ci ta y lo s pa ra li za L R P # 9 7 it 10 2 2 1 sets 11 2 2 2 him 12 1 1 3 on 13 2 2 4 and 14 3 3 5 it 15 5 5 6 takes 16 5 5 7 him 17 4 4 8 off 18 5 5 9 8 9 10 11 12 L′ 3 1 5 8 6 R′ 3 4 5 8 9 Figure 2: An example alignment and the corre- sponding L, R, P , L′ and R′ arrays. translations of each other in our model. Given a specific source substring, our algorithm asks: is it part of a consistent pair? To answer this question, we first compute the minimum target substring to which all words in the substring align. We then compute the minimum substring to which all words of this candidate translation align. If this substring matches the input, the candidate transla- tion is returned; otherwise, extraction fails. For example, it sets him on in range (10, 13) aligns to los excita in range (8, 9), which aligns back to (10, 13). So this is a consistent pair. However, him on and in range (12, 14) aligns to los excita y in range (8, 10), which aligns back to it sets him on and at (10, 14). So him on and is not part of a consistent pair and has no extractable translation. To extract gappy translation units, we subtract consistent pairs from other consistent pairs (Chiang, 2007). For example, (him, los) is a consistent pair. Subtracting it from (it sets him on, los excita) yields the translation unit (it sets ? on, ? excita). Our basic building block is the EXTRACT func- tion of He et al. (2013), which performs the above check using byte arrays denoted L, R, P , L′, and R′ (Figure 2) to identify extractable target phrases in target text T ′. When T[i] is a word, L[i] and R[i] store the sentence-relative positions of the leftmost and rightmost words it aligns to in T ′, and P[i] stores T[i]’s sentence-relative position. When T[i] is a sentence boundary, the concatenation of bytes L[i], R[i], and P [i], denoted LRP[i], stores the position of the corresponding sentence in T ′. Bytes L′[i′] and R′[i′] store the sentence-relative positions of the leftmost and rightmost words T ′[i′] aligns to. We first calculate the start position p of the source sentence containing T[i]...T[j], and the start posi- tion p′ of the corresponding target sentence: p = i−P [i] p′ = LRP[p] We then find target indices i′ and j′ for the candidate translation T ′[i′]...T ′[j′]: i′ = p′ + min k∈i,...,j L[k] j′ = p′ + max k∈i,...,j R[k] We can similarly find the translation T[i′′]...T[j′′] of T ′[i′]...T ′[j′]: i′′ = p + min k′∈i′,...,j′ L′[k′] j′′ = p + max k′∈i′,...,j′ R′[k′] If i = i′′ and j = j′′, EXTRACT(i,j) returns (i′,j′), the position of T[i]...T[j]’s translation. Otherwise the function signals that there is no extractable translation. Given this function, extraction proceeds in three passes. Pass 8: Extracting contiguous patterns Each match of pattern u is assigned to a concurrent thread. The thread receiving the match at position i returns the pair consisting of u and its translation according to EXTRACT(i, i + |u| − 1), if any. It also returns translations for patterns in which u is the only contiguous subpattern: ? u, u ?, and ? u ?. We extract translations for these patterns even if u has no translation itself. To see why, suppose that we reverse the translation direction of our example. In Figure 2, excita is not part of a consistent pair, but both ? excita and ? excita ? are. Consider ? u. Since ? matches any substring in T without boundary symbols, the leftmost position of 91 ? u is not fixed. So, we seek the smallest match with an extractable translation, returning its translation with the following algorithm. 1: k ← i 2: while T[k − 1] 6= # do 3: k ← k − 1 4: if EXTRACT(k,i + |u|− 1) succeeds then 5: (i′,j′) ← EXTRACT(k,i + |u|− 1) 6: if EXTRACT(k,i− 1) succeeds then 7: (p′,q′) ← EXTRACT(k,i− 1) 8: return T ′[i′]...T ′[p′] ? T ′[q′]...T ′[j′] 9: if T[k − 1] = # then 10: return failure The case of u ? is symmetric. We extend this algorithm to handle ? u ?. The extension considers increasingly distant pairs (k,`) for which ? u ? matches T[k]...T[`], until it either finds an extractable translation, or it encounters both sentence boundaries and fails. Pass 9: Extracting one-gap patterns (New) In this pass, each match of pattern u ? v is assigned to a thread. The thread receiving match (i,j) attempts to assign i′,j′,p′ and q′ as follows. (i′,j′) = EXTRACT(i,j + |v|− 1) (p′,q′) = EXTRACT(i + |u|,j − 1) If both calls succeed, we subtract to obtain the translation T ′[i′]...T ′[p′] ?T ′[q′]...T ′[j′]. We extract translations for ? u ? v and u ? v ?, using the same algorithm as in pass 7. Pass 10: Extracting two-gap patterns (New) In this pass, we extract a translation for each match of u ? v ? w concurrently, using a straightforward extension of the subtraction algorithm in pass 8. Since we only need patterns with up to two gaps, we do not consider other patterns. To optimize GPU performance, for the passes above, we assign all matches of a gappy pattern to the same thread block. This allows threads in the same thread block to share the same data during initialization, therefore improving memory access coherence. 2.4 Computing Every Feature In line 7 of Algorithm 1 we compute features of every extracted phrase pair. We use α and β to denote arbitrary strings of words and ? symbols, and our input is a multiset of (α,β) pairs collected by passes 7-9, which we denote by Π. We compute the following features for each unique (α,β) pair. Log-count features. We need two aggregate statis- tics: The count of (α,β) in Π, and the count of all pairs in Π for which α is the source pattern. We then compute the two features as log(1 + count(α,β)) and log(1 + count(α)). Translation log-probability. Given the aggregate counts above, this feature is log count(α,β)count(α) . Singleton indicators. We compute two features, to indicate whether (α,β) occurs only once, i.e., count(α,β) = 1, and whether α occurs only once, i.e., count(α) = 1. Lexical weight. Consider word pairs a,b with a ∈ α, b ∈ β, and neither a nor b are ?. Given a global word translation probability table p(a|b), which is exter- nally computed from the word alignments directly, the feature is ∑ a∈α maxb∈β log p(a|b). Since Π is the result of parallel computation, we must sort it. We can then compute aggregate statis- tics by keeping running totals in a scan of the sorted multiset. With many instantiated patterns, we would quickly exhaust GPU memory, so this sort is performed on the CPU. We compute the log-count features, translation log-probability, and singleton indicators this way. However, the lexical weight feature is a function only of the aligned translation pair itself and corresponding word-level translation possibilities calculated externally from word alignment. Thus, the computation of this feature can be parallelized on the GPU. So, we have multiple feature extraction passes based on the number of gaps: Pass 11 (CPU): One-gap features (New). Pass 12 (CPU): Two-gap features (New). Pass 13 (CPU): Contiguous features. Pass 14 (GPU): Lexical weight feature (New). 2.5 Sampling In serial implementations of on-demand extraction, very frequent patterns are a major computational bottleneck (Callison-Burch et al., 2005; Lopez, 2008b). Thus, for patterns occurring more than n times, for some fixed n (typically 92 between 100 and 1000), these implementations deterministically sample n matches, and only extract translations of these matches. To compare with these implementations, we also implement an optional sampling step. Though we use the same sampling rate, the samples themselves are not the same, since our extraction checks in passes 4 and 6 alter the set of matches that are actually enumerated, thus sampled from. The CPU algorithms do not use this check. 3 Experimental Setup We tested our algorithms in an end-to-end Chinese- English translation task using data conditions simi- lar to those of Lopez (2008b) and He et al. (2013). Our implementation of hierarchical grammar extrac- tion on the GPU, as detailed in the previous section, is written in C, using CUDA library v5.5 and GCC v4.8, compiled with the -O3 optimization flag. Our code is open source and available for researchers to download and try out.1 Hardware. We used NVIDIA’s Tesla K20c GPU (Kepler Generation), which has 2496 CUDA cores and 5 GB memory, with a peak memory bandwidth of 208 GB/s. The server hosting the GPU has two Intel Xeon E5-2690 CPUs, each with eight cores at 2.90 GHz (a total of 16 physical cores; 32 logical cores with hyperthreading). Both were released in 2012 and represent comparable generation hardware technology. All GPU and CPU experiments were conducted on the same machine, which runs Red Hat Enterprise Linux (RHEL) 6. Training Data. We used two training sets: The first consists of news articles from the Xinhua Agency, with 27 million words of Chinese (around one mil- lion sentences). The second adds parallel text from the United Nations, with 81 million words of Chi- nese (around four million sentences). Test Data. For performance evaluations, we ran tests on sentence batches of varying sizes: 100, 500, 1k, 2k, 4k, 6k, 8k, 16k and 32k. These sentences are drawn from the NIST 2002–2008 MT evaluations (on average 27 words each) and then the Chinese side of the Hong Kong Parallel Text (LDC2004T08) when the NIST data are smaller than the target batch 1http://hohocode.github.io/cgx/ size. Large batch sizes are necessary to saturate the processing power of the GPU. The size of the complete batch of 32k test sentences is 4892 KB. Baselines. We compared our GPU implementa- tion for on-demand extraction of hierarchical gram- mars against the corresponding CPU implementa- tion (Lopez, 2008a) found in pycdec (Chahuneau et al., 2012), an extension of cdec (Dyer et al., 2010).2 We also compared our GPU algorithms against Moses (Koehn et al., 2007), representing a standard phrase-based SMT baseline. Phrase tables generated by Moses are essentially the same as the GPU implementation of on-demand extraction for phrase-based translation by He et al. (2013). 4 Results 4.1 Translation quality We first verified that our GPU implementation achieves the same translation quality as the corresponding CPU baseline. This is accomplished by comparing system output against the baseline systems, training on Xinhua, tuning on NIST03, and testing on NIST05. In all cases, we used MIRA (Chiang, 2012) to tune parameters. We ran experiments three times and report the average as recommended by Clark et al. (2011). Hierarchical grammars were extracted with sampling at a rate of 300; we also bound source patterns at a length of 5 and matches at a length of 15. For Moses we used default parameters. Our BLEU scores, shown in Table 1, replicate well-known results where hierarchical models out- perform pure phrase-based models on this task. The difference in quality is partly because the phrase- based baseline system does not use lexicalized re- ordering, which provides similar improvements to hierarchical translation (Lopez, 2008b). Such lex- icalized reordering models cannot be produced by the GPU-based system of He et al. (2013). This establishes a clear translation quality improvement between our work and that of He et al. (2013). 2The Chahuneau et al. (2012) implementation is in Cython, a language for building Python applications with performance- critical components in C. All of the pattern matching code that we instrumented for these experiments is compiled to C/C++. The implementation is a port of the original code written by Lopez (2008a) in Pyrex, a precursor to Cython. Much of the code is unchanged. 93 http://hohocode.github.io/cgx/ System BLEU Moses phrase-based baseline 31.11 Hierarchical with online CPU extraction 33.37 Hierarchical with online GPU extraction 33.46 Table 1: Comparison of translation quality. The hierarchical system is cdec. Online CPU extraction is the baseline, part of the standard cdec package. Online GPU extraction is this work. We see that the BLEU score obtained by our GPU implementation of hierarchical grammar extraction is nearly identical to cdec’s, evidence that our im- plementation is correct. The minor differences in score are due to non-determinism in tuning and the difference in sampling algorithms (§2.5). 4.2 Extraction Speed Next, we focus on the performance of the hierar- chical grammar extraction component, comparing the CPU and GPU implementations. For both implementations, our timings include preparation of queries, pattern matching, extraction, and feature computation. For the GPU implementation, we include the time required to move data to and from the GPU. We do not include time for construction of static data structures (suffix arrays and indexes) and initial loading of the parallel corpus with alignment data, as those represent one-time costs. Note that the CPU implementation includes indexes for frequent patterns in the form u ? v and u ? v ? w, while our GPU implementation indexes only the former. We compared performance varying the number of queries, and following Lopez (2008a), we compared sampling at a rate of 300 against runs without sam- pling. Our primary evaluation metric is throughput: the average number of processed words per second (i.e., batch size in words divided by total time). We first establish throughput baselines on the CPU, shown in Table 2. Experiments used different numbers of threads under different data conditions (Xinhua or Xinhua + UN), with and without sam- pling. Our server has a total of 16 physical cores, but supports 32 logical cores via hyperthreading. We obtained the CPU sampling results by running cdec over 16k query sentences. For the non-sampling runs, since the throughput is so low, we measured +Sampling −Sampling threads X X+U X X+U 1 12.7 4.8 0.32 0.05 16 190.7 65.3 4.81 0.70 32 248.1 76.0 6.23 0.89 Table 2: CPU extraction performance (throughput in words/second) using different numbers of threads under different data conditions (X: Xinhua, X+U: Xinhua+UN), with and without sampling. performance over 2.6k sentences for Xinhua and 500 sentences for Xinhua + UN. We see that throughput scaling is slightly less than linear: with sampling, using 16 threads increases throughput by 15× on the Xinhua data (compared to a single thread) and 13.6× on Xinhua + UN data. Going from 16 to 32 threads further increase throughput by 15%- 30% and saturates the processing capacity of our server. The 32 thread condition provides a fair baseline for comparing the performance of the GPU implementation. Table 3 shows GPU hierarchical grammar extraction performance in terms of throughput (words/second); these results are averaged over three trials. We varied the number of query sentences, and in each case, also report the speedup with respect to the CPU condition with 32 threads. GPU throughput increases with larger batch sizes because we are increasingly able to saturate the GPU and take full advantage of the massive parallelism it offers. We do not observe this effect on the CPU since we saturate the processors early. With a batch size of 100 sentences, the GPU is slightly slower than the 32-thread CPU imple- mentation on Xinhua and faster on Xinhua + UN, both with sampling. Without sampling, the GPU is already an order of magnitude faster at a batch size of 100 sentences. At a batch size of 500 sentences, the GPU implementation is substantially faster than the 32-thread CPU version across all conditions. With the largest batch size in our exper- iments of 32k sentences, the GPU is over an order of magnitude faster than the fully-saturated CPU with sampling, and over two orders of magnitude faster without sampling. Although previous work does not show decreased translation quality due 94 Batch Size Number of Sentences 100 500 1k 2k 4k 6k 8k 16k 32k Number of Tokens 2.8k 14.5k 28.8k 57.9k 117.9k 161.9k 214.2k 436.5k 893.9k + S am pl in g Xinhua Throughput (words/s) 236 667 914 1356 1613 1794 2001 2793 3998 Speedup 1.0× 2.7× 3.7× 5.5× 6.5× 7.2× 8.1× 11.3× 16.1× Xinhua+UN Throughput (words/s) 106 223 287 400 454 514 571 709 1016 Speedup 1.4× 2.9× 3.8× 5.3× 6.0× 6.8× 7.5× 9.3× 13.4× − S am pl in g Xinhua Throughput (words/s) 84 280 405 690 929 1200 1414 2135 3240 Speedup 13× 45× 65× 111× 149× 193× 227× 343× 520× Xinhua+UN Throughput (words/s) 19 62 99 172 248 304 357 509 793 Speedup 22× 69× 112× 193× 279× 342× 401× 572× 891× Table 3: GPU grammar extraction throughput (words/second) under different batch sizes, data conditions, with and without sampling. Speedup is computed with respect to the CPU baseline running on 32 threads. +Sampling −Sampling X X+U X X+U GPU one-by-one 7.23 4.7 2.39 0.71 CPU single-thread 12.7 4.8 0.32 0.05 Table 4: Sentence-by-sentence GPU grammar extraction throughput (words/second) vs. a single thread on the CPU (X: Xinhua, X+U: Xinhua + UN). to sampling (Callison-Burch et al., 2005; Lopez, 2008b), these results illustrate the raw computa- tional potential of GPUs, showing that we can elim- inate heuristics that make CPU processing tractable. We believe future work can exploit these untapped processing cycles to improve translation quality. How does the GPU fare for translation tasks that demand low latency, such as sentence-by-sentence translation on the web? To find out, we conducted experiments where the sentences are fed, one by one, to the GPU grammar extraction algorithm. Results are shown in Table 4, with a comparison to a single- threaded CPU baseline. To be consistent with the other results, we also measure speed in terms of throughput here. Note that we are not aware of any freely available multi-threaded CPU algorithm to process an individual sentence in parallel, so the single-thread CPU comparison is reasonable.3 We observe that the GPU is slower only with sam- pling on the smaller Xinhua data. In all other 3Parallel sub-sentential parsing has been known for many years (Chandwani et al., 1992) although we don’t know of an implementation in any major open-source MT systems. Queries 2k 4k 6k 8k GPU PBMT 15s 25s 29s 33s GPU Hiero 43s 73s 90s 107s Slowdown 2.8× 2.9× 3.1× 3.2× Table 5: Grammar extraction time comparing this work (GPU Hiero) and the work of He et al. (2013) (GPU PBMT). cases, sentence-by-sentence processing on the GPU achieves a similar level of performance or is faster. Next, we compare the performance of hierarchical grammar extraction to phrase-based extraction on the GPU with sampling. We replicated the test data condition of He et al. (2013) so that our first 8k query sentences are the same as those used in their exper- iments. The results are shown in Table 5, where we report grammar extraction time for batches of different sizes; the bottom row shows the slowdown of the hierarchical vs. non-hierarchical grammar conditions. This quantifies the performance penalty to achieve the translation quality gains reported in Table 1. Hierarchical grammar extraction is about three times slower, primarily due to the computa- tional costs of the new passes presented in §2. Another aspect of performance is memory foot- print. We report the memory use (CPU RAM) of all four conditions in Table 6. The values reported for the CPU implementation use a single thread only. At runtime, our hierarchical GPU system exhibits peak CPU memory use of 8 GB on the host machine. Most of this memory is consumed by batching ex- 95 CPU GPU PBMT 1.6 GB 2.3 GB Hiero 1.9 GB 8.0 GB Table 6: Memory consumption (CPU RAM) for different experimental conditions. tracted phrases before scoring in passes 10 through 14. Since the phrase-based GPU implementation processes far fewer phrases, the memory footprint is much smaller. The CPU implementations pro- cess extracted phrases in small batches grouped by source phrase, and thus exhibit less memory usage. However, these levels of memory consumption are modest considering modern hardware. In all other respects, memory usage is similar for all systems, since the suffix array and associated data structures are all linear in the size of the indexed parallel text. 4.3 Per-Pass Speed To obtain a detailed picture of where the GPU speedups and bottlenecks are, we collected per-pass timing statistics. Table 7 shows results for grammar extraction on 6k queries using the Xinhua data with no sampling and default length constraints (passes in gray occur on the GPU; all others on the CPU). These numbers explain the decreased speed of hi- erarchical extraction compared to He et al. (2013), with the new passes (shown in italics) accounting for more than 75% of the total computation time. However, even passes that are nominally the same actually require more time in the hierarchical case: in extracting and scoring phrases associated with a contiguous pattern u, we must now also extract and score patterns ? u, u ?, and ? u ?. Interestingly, the CPU portions of our algorithm account for around half of the total grammar ex- traction time. One way to interpret this observation is that the massive parallelization provided by the GPU is so effective that we are bottlenecked by the CPU. In our current design, the CPU portions are those that cannot be easily parallelized on the GPU or those that require too much memory to fit on the GPU. The former is a possible target for optimization in future work, though the latter will likely be solved by hardware advances alone: for example, the Tesla K80 has 24 GB of memory. Pass Time % Contig. pattern pass 1 0.03 0.02% Contig. pattern pass 2 0.02 0.02% One-gap pattern generation 1.14 0.86% One-gap pattern matching 19.24 14.54% Two-gap pattern generation 0.37 0.28% Two-gap pattern matching 15.91 12.02% Gappy pattern processing 1.21 0.91% Contig. pattern extraction 13.94 10.53% Two-gap pattern extraction 0.52 0.39% One-gap pattern extraction 11.07 8.36% One gap translation features 23.72 17.91% Two gap translation features 30.90 23.34% Contig. translation features 5.00 3.77% Lexical weight feature 3.09 2.33% Data transfer and control 6.23 4.71% Total 132.39 100.0% Table 7: Detailed timings (in seconds) for 6k queries. Passes in gray occur on the GPU; all others on the CPU. Passes needed for hierarchical grammars are in italics, which are not present in He et al. (2013). 4.4 End-to-end Speed What is the benefit of using GPUs in an end-to- end translation task? Since we have shown that both the CPU and GPU implementations achieve near-identical translation quality, the difference lies in speed. But speed is difficult to measure fairly: translation involves not only grammar extraction but also decoding and associated I/O. We have focused on grammar extraction on the GPU, but our research vision involves eventually moving all components of the machine translation pipeline onto the GPU. The experiments we describe below capture the performance advantages of our implementation that are achievable today, using cdec for decoding (on the CPU, using 32 threads). To measure end-to-end translation speed using pycdec, per-sentence grammars are first extracted for a batch of sentences and written to disk, then read from disk during decoding. Therefore, we report times separately for grammar extraction, disk I/O, and decoding (which includes time for reading the grammar files from disk back into memory). Grammar extraction is either performed on the GPU 96 Grammar Extraction Disk I/O Decoding GPU: 30.8s 13.4s CPU: 59s CPU: 101.1s Table 8: Running times for an end-to-end translation pipeline over NIST03 test data. Grammar extraction is either performed on the GPU or the CPU (32 threads); other stages are the same for both conditions (decoding uses 32 threads). or on the CPU (using 32 threads), same as the experiments described in the previous section. Results are shown in Table 8 using the Xinhua training data and NIST03 test data (919 sentences, 27,045 words). All experiment settings are ex- actly the same as in the previous section. We observe an end-to-end translation throughput of 262 words/second with GPU grammar extraction and 156 words/second on the CPU (32 threads), for a speedup of 1.68×. Despite the speedup, we note that this experiment favors the CPU for several reasons. First, the GPU is idle during decoding, but it could be used to process grammars for a subsequent batch of sentences in a pipelined fashion. Second, NIST03 is a small batch that doesn’t fully saturate the GPU—throughput keeps increasing by a large margin with larger batch sizes (see results in Table 3). Third, in comparison to the 32-thread CPU baseline, our GPU extraction only uses a single thread on the CPU throughout its execution, thus the CPU portion of the performance can be further improved, especially in the feature generation passes (see §4.3 for details). Of course, the GPU/CPU combination requires a server equipped with a GPU, incurring additional hardware costs. We estimate that in Q4 2014 dollars, our base system would cost roughly $7500 USD, and the GPU would cost another $2600 USD. However, the server-grade GPU used in this work is not the only choice: a typical high-end consumer GPU, such as the NVIDIA GTX Titan Black (around $1100), costs considerably less but has even higher memory bandwidth and with similarly impressive floating point performance. This price difference is due to extra functionalities (e.g., error-correcting code memory) for specific applications (e.g., sci- entific computing), and is not directly related to differences in raw computational power. This means that we could speed up overall translation by 68% if we spend an additional 35% (server-grade GPU) or 15% (consumer-grade GPU) on hardware. From an economic perspective, this is an attractive proposi- tion. Of course, the advantages of using GPUs for high-throughput translation go up further with larger batch sizes. 4.5 One-time Construction Costs Construction of static data structures for on-demand grammar extraction is a one-time cost given a corpus T . However, under a streaming scenario where we might receive incremental changes to T as new training data become available, we need to update the data structures appropriately. Updating static data structures involves two costs: the suffix array with its LCP array and the precom- putation indexes. We do not consider the alignment construction cost as it is external to cdec (and is a necessary step for all implementations). For the Xinhua data, building the suffix array using a single CPU thread takes 29.2 seconds and building the precomputation indexes on the GPU takes 5.7 seconds. Compared to Table 7, these one-time costs represent approximately 27% of the GPU grammar extraction time. It is possible to lower the construction costs of these data structures given recent advances. Lev- enberg et al. (2010) describe novel algorithms that allow efficient in-place updates of the suffix array when new training data arrive. That work directly tackles on-demand SMT architectures in the stream- ing data scenario. Alternatively, the speed of suffix array construction can be improved significantly by the CUDA Data Parallel Primitives Library,4 which provides fast sorting algorithms to efficiently construct suffix arrays on the GPU. Minimizing data preparation costs has not been a focus of this work, but we believe that the massive parallelism provided by the GPU represents promising future work. 5 Conclusions and Future Work The increasing demands for translation services be- cause of globalization (Pangeanic, 2013; Sykes, 2009) make high-throughput translation a realistic 4http://cudpp.github.io/cudpp/2.2/ 97 http://cudpp.github.io/cudpp/2.2/ scenario, and one that our GPU implementation is highly suited to serve. High-throughput translation also enables downstream applications such as doc- ument translation in cross-language information re- trieval (Oard and Hackett, 1997), where we translate the entire source document collection into the target language prior to indexing. The number of transistors on a chip continues to increase exponentially, a trend that even pessimists concede should continue at least until the end of the decade (Vardi, 2014). Computer architects widely agree that instruction-level hardware parallelism is long past the point of diminishing returns (Olukotun and Hammond, 2005). This has led to a trend of placing greater numbers of cores on the same die. The question is how to best utilize the transistor budget: a small number of complex cores, a large number of simple cores, or a mixture of both? For our problem, it appears that we can take advantage of brute force scans and fine-grained parallelism inherent in the problem of on-demand extraction, which makes investments in large numbers of simple cores (as on the GPU) a win. This observation is in line with trends in other ar- eas of computing. Many problems in computational biology, like computational linguistics, boil down to efficient search on discrete sequences of symbols. DNA sequence alignment systems MummerGPU (Schatz et al., 2007) and MummerGPU 2 (Trap- nell and Schatz, 2009) use suffix trees to perform DNA sequence matching on the GPU, while the state-of-the-art system MummurGPU++ (Gharaibeh and Ripeanu, 2010) uses suffix arrays, as we do here. Our algorithms for matching gappy patterns in passes 3 and 4 are closely related to seed-and-extend algorithms for approximate matching in DNA se- quences, which have recently been implemented on GPUs (Wilton et al., 2014). It is unlikely that CPU processing will become obsolete, since not all problems can be cast in a data- parallel framework. Ultimately, we need a hybrid architecture where parallelizable tasks are offloaded to the GPU, which works in conjunction with the CPU to handle irregular computations in a pipelined fashion. In a well-balanced system, both the GPU and the CPU would be fully utilized, performing the types of computation they excel at, unlike in our current design, where the GPU sits idle while the CPU finishes decoding. A part of our broad research agenda is exploring which aspects of the machine translation pipeline are amenable to GPU algorithms. The performance analysis in §4.3 shows that even in grammar extraction there are CPU bottlenecks we need to address and opportunities for further optimization. Beyond grammar extraction, there is a question about whether decoding can be moved to the GPU. Memory is a big hurdle: since accessing data struc- tures off-GPU is costly, it would be preferable to hold all models in GPU memory. We’ve addressed the problem for translation models, but the language models used in machine translation are also large. It might be possible to use lossy compression (Tal- bot and Osborne, 2007) or batch request strategies (Brants et al., 2007) to solve this problem. If we do, we believe that translation models could be decoded using variants of GPU algorithms for speech (Chong et al., 2009) or parsing (Yi et al., 2011; Canny et al., 2013; Hall et al., 2014), though the latter algorithms exploit properties of latent-variable grammars that may not extend to translation. Thinking beyond decoding, we believe that other problems in compu- tational linguistics might benefit from the massive parallelism offered by GPUs. Acknowledgments This research was supported in part by the BOLT program of the Defense Advanced Research Projects Agency, Contract No. HR0011-12-C-0015; NSF under award IIS-1218043; and the Human Language Technology Center of Excellence at Johns Hopkins University. Any opinions, findings, conclusions, or recommendations expressed in this paper are those of the authors and do not necessarily reflect views of the sponsors. The second author is grateful to Esther and Kiri for their loving support and dedicates this work to Joshua and Jacob. We thank Nikolay Bogoychev and the anonymous TACL re- viewers for helpful comments on previous drafts, Rich Cox for support and advice, and CLIP labmates (particularly Junhui Li and Wu Ke) for helpful discussions. We also thank UMIACS for providing hardware resources via the NVIDIA CUDA Center of Excellence, and the UMIACS IT staff, especially Joe Webster, for excellent support. 98 References Paul Baltescu and Phil Blunsom. 2014. A fast and simple online synchronous context free grammar extractor. The Prague Bulletin of Mathematical Linguistics, 102(1):17–26. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), pages 858–867. Chris Callison-Burch, Colin Bannard, and Josh Schroeder. 2005. Scaling phrase-based statistical machine translation to larger corpora and longer phrases. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL 2005), pages 255–262. John Canny, David Hall, and Dan Klein. 2013. A multi-teraflop constituency parser using GPUs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processin (EMNLP 2013), pages 1898–1907. Victor Chahuneau, Noah A. Smith, and Chris Dyer. 2012. pycdec: A Python interface to cdec. In Proceedings of the 7th Machine Translation Marathon (MTM 2012). M. Chandwani, M. Puranik, and N. S. Chaudhari. 1992. On CKY-parsing of context-free grammars in parallel. In Proceedings of Technology Enabling Tomorrow: Computers, Communications and Automation towards the 21st Century, pages 141–145. David Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33(2):201– 228. David Chiang. 2012. Hope and fear for discriminative training of statistical translation models. Journal of Machine Learning Research, 13:1159–1187. Jike Chong, Ekaterina Gonina, Youngmin Yi, and Kurt Keutzer. 2009. A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit. In Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH 2009), pages 1183–1186. Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011), pages 176–181. Chris Dyer, Adam Lopez, Juri Ganitkevitch, Johnathan Weese, Ferhan Ture, Phil Blunsom, Hendra Setiawan, Vladimir Eidelman, and Philip Resnik. 2010. cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In Proceedings of the ACL 2010 System Demonstrations, pages 7–12. Michel Galley and Christopher D. Manning. 2010. Accurate non-hierarchical phrase-based translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the As- sociation for Computational Linguistics (HLT/NAACL 2010), pages 966–974. Abdullah Gharaibeh and Matei Ripeanu. 2010. Size matters: Space/time tradeoffs to improve GPGPU applications performance. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2010), pages 1–12. Jesús González-Rubio, Daniel Ortiz-Martı́nez, and Francisco Casacuberta. 2012. Active learning for interactive machine translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012), pages 245–254. David Hall, Taylor Berg-Kirkpatrick, and Dan Klein. 2014. Sparser, better, faster GPU parsing. In Proceed- ings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), pages 208– 217. Hua He, Jimmy Lin, and Adam Lopez. 2013. Massively parallel suffix array queries and on-demand phrase extraction for statistical machine translation using GPUs. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 325–334. Jared Hoberock and Nathan Bell. 2010. Thrust: A parallel template library. Version 1.8.0. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Con- stantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pages 177–180. Abby Levenberg, Chris Callison-Burch, and Miles Osborne. 2010. Stream-based translation models for statistical machine translation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2010), pages 394–402. 99 Adam Lopez. 2007. Hierarchical phrase-based translation with suffix arrays. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), pages 976–985. Adam Lopez. 2008a. Machine translation by pattern matching. Ph.D. dissertation, University of Maryland, College Park, Maryland, USA. Adam Lopez. 2008b. Tera-scale translation models via pattern matching. In Proceedings of the 22nd In- ternational Conference on Computational Linguistics (COLING 2008), pages 505–512. Udi Manber and Gene Myers. 1990. Suffix arrays: a new method for on-line string searches. In Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’90), pages 319–327. Douglas W. Oard and Paul Hackett. 1997. Document translation for cross-language text retrieval at the University of Maryland. In Proceedings of the 7th Text REtrieval Conference (TREC-7). Franz Josef Och, Christoph Tillmann, and Hermann Ney. 1999. Improved alignment models for statistical machine translation. In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP 1999), pages 20–28. Kunle Olukotun and Lance Hammond. 2005. The future of microprocessors. ACM Queue, 3(7):27–34. Pangeanic, 2013. What is The Size of the Translation Industry? http://www.pangeanic.com/knowledge center/size-translation-industry/. Aaron B. Philips. 2012. Modeling Relevance in Statistical Machine Translation: Scoring Alignment, Context, and Annotations of Translation Instances. Ph.D. thesis, Carnegie Mellon University. Michael Schatz, Cole Trapnell, Arthur Delcher, and Amitabh Varshney. 2007. High-throughput sequence alignment using graphics processing units. BMC Bioinformatics, 8(1):474. Michel Simard, Nicola Cancedda, Bruno Cavestro, Marc Dymetman, Eric Gaussier, Cyril Goutte, and Kenji Yamada. 2005. Translating with non-contiguous phrases. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (EMNLP 2005), pages 755–762. Patrick Simianer, Stefan Riezler, and Chris Dyer. 2012. Joint feature selection in distributed stochastic learning for large-scale discriminative training in SMT. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), pages 11–21. Tanisha Sykes, 2009. Growth in Translation. http://www.inc.com/articles/2009/08/translation.html. David Talbot and Miles Osborne. 2007. Randomised language modelling for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), pages 512–519. Cole Trapnell and Michael C. Schatz. 2009. Optimizing data intensive GPGPU computations for DNA se- quence alignment. Parallel Computing, 35(8-9):429– 440. Moshe Y. Vardi. 2014. Moore’s Law and the sand-heap paradox. Communications of the ACM, 57(5):5. Richard Wilton, Tamas Budavari, Ben Langmead, Sarah Wheelan, Steven L. Salzberg, and Alex Szalay. 2014. Faster sequence alignment through GPU- accelerated restriction of the seed-and-extend search space. http://dx.doi.org/10.1101/007641. Youngmin Yi, Chao-Yue Lai, Slav Petrov, and Kurt Keutzer. 2011. Efficient parallel CKY parsing on GPUs. In Proceedings of the 12th International Conference on Parsing Technologies, pages 175–185. Ying Zhang and Stephan Vogel. 2005. An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora. In Proceedings of the Tenth Conference of the European Association for Machine Translation (EAMT-05). 100 http://www.pangeanic.com/knowledge_center/size-translation-industry/ http://www.pangeanic.com/knowledge_center/size-translation-industry/