Decoding Anagrammed Texts Written in an Unknown Language and Script Bradley Hauer and Grzegorz Kondrak Department of Computing Science University of Alberta Edmonton, Canada {bmhauer,gkondrak}@ualberta.ca Abstract Algorithmic decipherment is a prime exam- ple of a truly unsupervised problem. The first step in the decipherment process is the iden- tification of the encrypted language. We pro- pose three methods for determining the source language of a document enciphered with a monoalphabetic substitution cipher. The best method achieves 97% accuracy on 380 lan- guages. We then present an approach to de- coding anagrammed substitution ciphers, in which the letters within words have been ar- bitrarily transposed. It obtains the average de- cryption word accuracy of 93% on a set of 50 ciphertexts in 5 languages. Finally, we report the results on the Voynich manuscript, an un- solved fifteenth century cipher, which suggest Hebrew as the language of the document. 1 Introduction The Voynich manuscript is a medieval codex1 con- sisting of 240 pages written in a unique script, which has been referred to as the world’s most important unsolved cipher (Schmeh, 2013). The type of ci- pher that was used to generate the text is unknown; a number of theories have been proposed, includ- ing substitution and transposition ciphers, an abjad (a writing system in which vowels are not written), steganography, semi-random schemes, and an elab- orate hoax. However, the biggest obstacle to deci- 1The manuscript was radiocarbon dated to 1404-1438 AD in the Arizona Accelerator Mass Spectrometry Labo- ratory (http://www.arizona.edu/crack-voynich-code, accessed Nov. 20, 2015). phering the manuscript is the lack of knowledge of what language it represents. Identification of the underlying language has been crucial for the decipherment of ancient scripts, in- cluding Egyptian hieroglyphics (Coptic), Linear B (Greek), and Mayan glyphs (Ch’olti’). On the other hand, the languages of many undeciphered scripts, such as Linear A, the Indus script, and the Phaistos Disc, remain unknown (Robinson, 2002). Even the order of characters within text may be in doubt; in Egyptian hieroglyphic inscriptions, for instance, the symbols were sometimes rearranged within a word in order to create a more elegant inscription (Singh, 2011). Another complicating factor is the omission of vowels in some writing systems. Applications of ciphertext language identification extend beyond secret ciphers and ancient scripts. Nagy et al. (1987) frame optical character recogni- tion as a decipherment task. Knight et al. (2006) note that for some languages, such as Hindi, there exist many different and incompatible encoding schemes for digital storage of text; the task of an- alyzing such an arbitrary encoding scheme can be viewed as a decipherment of a substitution cipher in an unknown language. Similarly, the unsupervised derivation of transliteration mappings between dif- ferent writing scripts lends itself to a cipher formu- lation (Ravi and Knight, 2009). The Voynich manuscript is written in an unknown script that encodes an unknown language, which is the most challenging type of a decipherment prob- lem (Robinson, 2002, p. 46). Inspired by the mys- tery of both the Voynich manuscript and the un- deciphered ancient scripts, we develop a series of 75 Transactions of the Association for Computational Linguistics, vol. 4, pp. 75–86, 2016. Action Editor: Regina Barzilay. Submission batch: 12/2015; Published 4/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. algorithms for the purpose of decrypting unknown alphabetic scripts representing unknown languages. We assume that symbols in scripts which contain no more than a few dozen unique characters roughly correspond to phonemes of a language, and model them as monoalphabetic substitution ciphers. We further allow that an unknown transposition scheme could have been applied to the enciphered text, resulting in arbitrary scrambling of letters within words (anagramming). Finally, we consider the pos- sibility that the underlying script is an abjad, in which only consonants are explicitly represented. Our decryption system is composed of three steps. The first task is to identify the language of a cipher- text, by comparing it to samples representing known languages. The second task is to map each symbol of the ciphertext to the corresponding letter in the identified language. The third task is to decode the resulting anagrams into readable text, which may in- volve the recovery of unwritten vowels. The paper is structured as follows. We discuss re- lated work in Section 2. In Section 3, we propose three methods for the source language identification of texts enciphered with a monoalphabetic substitu- tion cipher. In Section 4, we present and evaluate our approach to the decryption of texts composed of en- ciphered anagrams. In Section 5, we apply our new techniques to the Voynich manuscript. Section 6 concludes the paper. 2 Related Work In this section, we review particularly relevant prior work on the Voynich manuscript, and on algorithmic decipherment in general. 2.1 Voynich Manuscript Since the discovery of the Voynich manuscript (henceforth referred to as the VMS), there have been a number of decipherments claims. Newbold and Kent (1928) proposed an interpretation based on mi- croscopic details in the text, which was subsequently refuted by Manly (1931). Other claimed decipher- ments by Feely (1943) and Strong (1945) have also been refuted (Tiltman, 1968). A detailed study of the manuscript by d’Imperio (1978) details various other proposed solutions and the arguments against them. Figure 1: A sample from the Voynich manuscript. Numerous languages have been proposed to un- derlie the VMS. The properties and the dating of the manuscript imply Latin and Italian as potential can- didates. On the basis of the analysis of the character frequency distribution, Jaskiewicz (2011) identifies five most probable languages, which include Mol- davian and Thai. Reddy and Knight (2011) discover an excellent match between the VMS and Quranic Arabic in the distribution of word lengths, as well as a similarity to Chinese Pinyin in the predictability of letters given the preceding letter. It has been suggested previously that some ana- gramming scheme may alter the sequence order of characters within words in the VMS. Tiltman (1968) observes that each symbol behaves as if it had its own place in an “order of precedence” within words. Rugg (2004) notes the apparent similarity of the VMS to a text in which each word has been replaced by an alphabetically ordered anagram (alphagram). Reddy and Knight (2011) show that the letter se- quences are generally more predictable than in nat- ural languages. Some researchers have argued that the VMS may be an elaborate hoax created to only appear as a meaningful text. Rugg (2004) suggests a tabular method, similar to the sixteenth century technique of the Cardan grille, although recent dating of the manuscript to the fifteenth century provides evi- dence to the contrary. Schinner (2007) uses analy- sis of random walk techniques and textual statistics to support the hoax hypothesis. On the other hand, Landini (2001) identifies in the VMS language-like statistical properties, such as Zipf’s law, which were only discovered in the last century. Similarly, Mon- temurro and Zanette (2013) use information theo- retic techniques to find long-range relationships be- tween words and sections of the manuscript, as well as between the text and the figures in the VMS. 76 2.2 Algorithmic Decipherment A monoalphabetic substitution cipher is a well- known method of enciphering a plaintext by con- verting it into a ciphertext of the same length using a 1-to-1 mapping of symbols. Knight et al. (2006) propose a method for deciphering substitution ci- phers which is based on Viterbi decoding with map- ping probabilities computed with the expectation- maximization (EM) algorithm. The method cor- rectly deciphers 90% of symbols in a 400-letter ci- phertext when a trigram character language model is used. They apply their method to ciphertext language identification using 80 different language samples, and report successful outcomes on three ci- phers that represent English, Spanish, and a Spanish abjad, respectively. Ravi and Knight (2008) present a more complex but slower method for solving substitution ciphers, which incorporates constraints that model the 1-to-1 property of the key. The objective function is again the probability of the decipherment relative to an n- gram character language model. A solution is found by optimally solving an integer linear program. Knight et al. (2011) describe a successful deci- pherment of an eighteenth century text known as the Copiale Cipher. Language identification was the first step of the process. The EM-based method of Knight et al. (2006) identified German as the most likely candidate among over 40 candidate charac- ter language models. The more accurate method of Ravi and Knight (2008) was presumably either too slow or too brittle for this purpose. The cipher was eventually broken using a combination of man- ual and algorithmic techniques. Hauer et al. (2014) present an approach to solv- ing monoalphabetic substitution ciphers which is more accurate than other algorithms proposed for this task, including Knight et al. (2006), Ravi and Knight (2008), and Norvig (2009). We provide a detailed description of the method in Section 4.1. 3 Source Language Identification In this section, we propose and evaluate three meth- ods for determining the source language of a docu- ment enciphered with a monoalphabetic substitution cipher. We frame it as a classification task, with the classes corresponding to the candidate languages, which are represented by short sample texts. The methods are based on: 1. relative character frequencies, 2. patterns of repeated symbols within words, 3. the outcome of a trial decipherment. 3.1 Character Frequency An intuitive way of guessing the source language of a ciphertext is by character frequency analysis. The key observation is that the relative frequencies of symbols in the text are unchanged after encipher- ment with a 1-to-1 substitution cipher. The idea is to order the ciphertext symbols by frequency, normal- ize these frequencies to create a probability distri- bution, and choose the closest matching distribution from the set of candidate languages. More formally, let PT be a discrete probability distribution where PT (i) is the probability of a ran- domly selected symbol in a text T being the ith most frequent symbol. We define the distance between two texts U and V to be the Bhattacharyya (1943) distance between the probability distributions PU and PV : d(U, V ) = − ln ∑ i √ PU(i) ·PV (i) The advantages of this distance metric include its symmetry, and the ability to account for events that have a zero probability (in this case, due to different alphabet sizes). The language of the closest sample text to the ciphertext is considered to be the most likely source language. This method is not only fast but also robust against letter reordering and the lack of word boundaries. 3.2 Decomposition Pattern Frequency Our second method expands on the character fre- quency method by incorporating the notion of de- composition patterns. This method uses multiple oc- currences of individual symbols within a word as a clue to the language of the ciphertext. For example, the word seems contains two instances of ‘s’ and ‘e’, and one instance of ‘m’. We are interested in captur- ing the relative frequency of such patterns in texts, independent of the symbols used. 77 Formally, we define a function f that maps a word to an ordered n-tuple (t1, t2, . . . tn), where ti ≥ tj if i < j. Each ti is the number of oc- currences of the ith most frequent character in the word. For example, f(seems) = (2, 2, 1), while f(beams) = (1, 1, 1, 1, 1). We refer to the resulting tuple as the decomposition pattern of the word. The decomposition pattern is unaffected by monoalpha- betic letter substitution or anagramming. As with the character frequency method, we define the distance between two texts as the Bhattacharyya distance be- tween their decomposition pattern distributions, and classify the language of a ciphertext as the language of the nearest sample text. It is worth noting that this method requires word separators to be preserved in the ciphertext. In fact, the effectiveness of the method comes partly from capturing the distribution of word lengths in a text. On the other hand, the decomposition patterns are independent of the ordering of characters within words. We will take advantage of this property in Section 4. 3.3 Trial Decipherment The final method that we present involves decipher- ing the document in question into each candidate language. The decipherment is performed with a fast greedy-swap algorithm, which is related to the algorithms of Ravi and Knight (2008) and Norvig (2009). It attempts to find the key that maximizes the probability of the decipherment according to a bi- gram character language model derived from a sam- ple document in a given language. The decipher- ment with the highest probability indicates the most likely plaintext language of the document. The greedy-swap algorithm is shown in Figure 2. The initial key is created by pairing the ciphertext and plaintext symbols in the order of decreasing fre- quency, with null symbols appended to the shorter of the two alphabets. The algorithm repeatedly at- tempts to improve the current key k by considering the “best” swaps of ciphertext symbol pairs within the key (if the key is viewed as a permutation of the alphabet, such a swap is a transposition). The best swaps are defined as those that involve a sym- bol occurring among the 10 least common bigrams in the decipherment induced by the current key. If any such swap yields a more probable decipherment, 1: kmax ← InitialKey 2: for m iterations do 3: k ← kmax 4: S ← best swaps for k 5: for each {c1, c2}∈S do 6: k′ ← k(c1↔c2) 7: if p(k′) > p(kmax) then kmax ← k′ 8: if kmax = k then return kmax 9: return kmax Figure 2: Greedy-swap decipherment algorithm. it is incorporated in the current key; otherwise, the algorithm terminates. The total number of iterations is bounded by m, which is set to 5 times the size of the alphabet. After the initial run, the algorithm is restarted 20 times with a randomly generated ini- tial key, which often results in a better decipherment. All parameters were established on a development set. 3.4 Evaluation We now directly evaluate the three methods de- scribed above by applying them to a set of cipher- texts from different languages. We adapted the dataset created by Emerson et al. (2014) from the text of the Universal Declaration of Human Rights (UDHR) in 380 languages.2 The average length of the texts is 1710 words and 11073 characters. We divided the text in each language into 66% train- ing, 17% development, and 17% test. The training part was used to derive character bigram models for each language. The development and test parts were separately enciphered with a random substitution ci- pher. Table 1 shows the results of the language identifi- cation methods on both the development and the test set. We report the average top-1 accuracy on the task of identifying the source language of 380 enciphered test samples. The differences between methods are statistically significant according to McNemar’s test with p < 0.0001. The random baseline of 0.3% in- dicates the difficulty of the task. The “oracle” de- cipherment assumes a perfect decipherment of the text, which effectively reduces the task to standard 2Eight languages from the original set were excluded be- cause of formatting issues. 78 Method Dev Test Random Selection 0.3 0.3 Jaskiewicz (2011) 54.2 47.6 Character Frequency 72.4 67.9 Decomposition Pattern 90.5 85.5 Trial Decipherment 94.2 97.1 Oracle Decipherment 98.2 98.4 Table 1: Language identification accuracy (in % correct) on ciphers representing 380 languages. language identification. All three of our methods perform well, with the accuracy gains reflecting their increasing complex- ity. Between the two character frequency methods, our approach based on Bhattacharyya distance is significantly more accurate than the method of Jask- iewicz (2011), which uses a specially-designed dis- tribution distance function. The decomposition pat- tern method makes many fewer errors, with the cor- rect language ranked second in roughly half of those cases. Trial decipherment yields the best results, which are close to the upper bound for the character bigram probability approach to language identifica- tion. The average decipherment error rate into the correct language is only 2.5%. In 4 out of 11 identi- fication errors made on the test set, the error rate is above the average; the other 7 errors involve closely related languages, such as Serbian and Bosnian. The trial decipherment approach is much slower than the frequency distribution methods, requiring roughly one hour of CPU time in order to classify each ciphertext. More complex decipherment algo- rithms are even slower, which precludes their appli- cation to this test set. Our re-implementations of the dynamic programming algorithm of Knight et al. (2006), and the integer programming solver of Ravi and Knight (2008) average 53 and 7000 seconds of CPU time, respectively, to solve a single 256 charac- ter cipher, compared to 2.6 seconds with our greedy- swap method. The dynamic programming algorithm improves decipherment accuracy over our method by only 4% on a benchmark set of 50 ciphers of 256 characters. We conclude that our greedy-swap al- gorithm strikes the right balance between accuracy and speed required for the task of cipher language identification. 4 Anagram Decryption In this section, we address the challenging task of deciphering a text in an unknown language written using an unknown script, and in which the letters within words have been randomly scrambled. The task is designed to emulate the decipherment prob- lem posed by the VMS, with the assumption that its unusual ordering of characters within words re- flects some kind of a transposition cipher. We re- strict the source language to be one of the candi- date languages for which we have sample texts; we model an unknown script with a substitution cipher; and we impose no constraints on the letter transposi- tion method. The encipherment process is illustrated in Figure 3. The goal in this instance is to recover the plaintext in (a) given the ciphertext in (c) without the knowledge of the plaintext language. We also con- sider an additional encipherment step that removes all vowels from the plaintext. Our solution is composed of a sequence of three modules that address the following tasks: language identification, script decipherment, and anagram de- coding. For the first task we use the decomposition pattern frequency method described in Section 3.2, which is applicable to anagrammed ciphers. After identifying the plaintext language, we proceed to re- verse the substitution cipher using a heuristic search algorithm guided by a combination of word and character language models. Finally, we unscramble the anagrammed words into readable text by fram- ing the decoding as a tagging task, which is effi- ciently solved with a Viterbi decoder. Our modular approach makes it easy to perform different levels of analysis on unsolved ciphers. 4.1 Script Decipherment For the decipherment step, we adapt the state-of-the- art solver of Hauer et al. (2014). In this section, we describe the three main components of the solver: key scoring, key mutation, and tree search. This is followed by the summary of modifications that make the method work on anagrams. The scoring component evaluates the fitness of each key by computing the smoothed probability of the resulting decipherment with both character- level and word-level language models. The word- level models promote decipherments that contain 79 (a) organized compositions through improvisational music into genres (b) fyovicstu dfnrfecpcfie pbyfzob cnryfgcevpcfivm nzecd cipf otiyte (c) otvfusyci cpifenfercfd bopbfzy fgyiemcpfcvrcnv nczed fpic etotyi (d) adegiknor ciimnooopsst ghhortu aaiiilmnooprstv cimsu inot eegnrs (e) adegiknor compositions through aaiiilmnooprstv music into greens Figure 3: An example of the encryption and decryption process: (a) plaintext; (b) after applying a substitution cipher; (c) ciphertext after random anagramming; (d) after substitution decipherment (in the alphagram representation); (e) final decipherment after anagram decoding (errors are underlined). in-vocabulary words and high-probability word n- grams, while the character level models allow for the incorporation of out-of-vocabulary words. The key mutation component crucially depends on the notion of pattern equivalence between char- acter strings. Two strings are pattern-equivalent if they share the same pattern of repeated letters. For example, MZXCX is pattern-equivalent with there and bases. but not with otter. For each word uni- gram, bigram, and trigram in the ciphertext, a list of the most frequent pattern equivalent n-grams from the training corpus is compiled. The solver repeat- edly attempts to improve the current key through a series of transpositions, so that a given cipher n- gram maps to a pattern-equivalent n-gram from the provided language sample. The number of substi- tutions for a given n-gram is limited to the k most promising candidates, where k is a parameter opti- mized on a development set. The key mutation procedure generates a tree structure, which is searched for the best-scoring de- cipherment using a version of beam search. The root of the tree contains the initial key, which is gener- ated according to simple frequency analysis (i.e., by mapping the n-th most common ciphertext character to the n-th most common character in the corpus). New tree leaves are spawned by modifying the keys of current leaves, while ensuring that each node in the tree has a unique key. At the end of computa- tion, the key with the highest score is returned as the solution. In our anagram adaptation, we relax the definition of pattern equivalence to include strings that have the same decomposition pattern, as defined in Sec- tion 3.2. Under the new definition, the order of the letters within a word has no effect on pattern equiv- alence. For example, MZXCX is equivalent not only with there and bases, but also with three and otter, because all these words map to the (2, 1, 1, 1) pat- tern. Internally, we represent all words as alpha- grams, in which letters are reshuffled into the alpha- betical order (Figure 3d). In order to handle the in- creased ambiguity, we use a letter-frequency heuris- tic to select the most likely mapping of letters within an n-gram. The trigram language models over both words and characters are derived by converting each word in the training corpus into its alphagram. On a benchmark set of 50 ciphers of length 256, the av- erage error rate of the modified solver is 2.6%, with only a small increase in time and space usage. 4.2 Anagram Decoder The output of the script decipherment step is gener- ally unreadable (see Figure 3d). The words might be composed of the right letters but their order is unlikely to be correct. We proceed to decode the sequence of anagrams by framing it as a simple hid- den Markov model, in which the hidden states corre- spond to plaintext words, and the observed sequence is composed of their anagrams. Without loss of gen- erality, we convert anagrams into alphagrams, so that the emission probabilities are always equal to 1. Any alphagrams that correspond to unseen words are replaced with a single ‘unknown’ type. We then use a modified Viterbi decoder to determine the most likely word sequence according to a word trigram language model, which is derived from the train- ing corpus, and smoothed using deleted interpola- tion (Jelinek and Mercer, 1980). 4.3 Vowel Recovery Many writing systems, including Arabic and He- brew, are abjads that do not explicitly represent vow- els. Reddy and Knight (2011) provide evidence that the VMS may encode an abjad. The removal of vowels represents a substantial loss of information, 80 and appears to dramatically increase the difficulty of solving a cipher. In order to apply our system to abjads, we re- move all vowels in the corpora prior to deriving the language models used by the script decipherment step. We assume the ability to partition the plaintext symbols into disjoint sets of vowels and consonants for each candidate language. The anagram decoder is trained to recover complete in-vocabulary words from sequences of anagrams containing only conso- nants. At test time, we remove the vowels from the input to the decipherment step of the pipeline. In contrast with Knight et al. (2006), our approach is able not only to attack abjad ciphers, but also to re- store the vowels, producing fully readable text. 4.4 Evaluation In order to test our anagram decryption pipeline on out-of-domain ciphertexts, the corpora for deriving language models need to be much larger than the UDHR samples used in the previous section. We selected five diverse European languages from Eu- roparl (Koehn, 2005): English, Bulgarian, German, Greek, and Spanish. The corresponding corpora contain about 50 million words each, with the excep- tion of Bulgarian which has only 9 million words. We remove punctuation and numbers, and lowercase all text. We test on texts extracted from Wikipedia articles on art, Earth, Europe, film, history, language, music, science, technology, and Wikipedia. The texts are first enciphered using a substitution cipher, and then anagrammed (Figure 3a-c). Each of the five lan- guages is represented by 10 ciphertexts, which are decrypted independently. In order to keep the run- ning time reasonable, the length of the ciphertexts is set to 500 characters. The first step is language identification. Our de- composition pattern method, which is resistant to both anagramming and substitution, correctly iden- tifies the source language of 49 out of 50 cipher- texts. The lone exception is the German article on technology, for which German is the second ranked language after Greek. This error could be easily de- tected by noticing that most of the Greek words “de- ciphered” by the subsequent steps are out of vocab- ulary. We proceed to evaluate the following steps assuming that the source language is known. Step 2 Step 3 Both Ceiling English 99.5 98.2 97.7 98.7 Bulgarian 97.0 94.7 91.9 95.3 German 97.3 90.6 88.7 91.8 Greek 95.7 96.6 92.7 97.2 Spanish 99.1 98.0 97.1 99.0 Average 97.7 95.7 93.8 96.5 Table 2: Word accuracy on the anagram decryption task. The results in Table 2 show that our system is able to effectively break the anagrammed ciphers in all five languages. For Step 2 (script decipherment), we count as correct all word tokens that contain the right characters, disregarding their order. Step 3 (ana- gram decoding) is evaluated under the assumption that it has received a perfect decipherment from Step 2. On average, the accuracy of each individual step exceeds 95%. The values in the column denoted as Both are the actual results of the pipeline composed of Steps 2 and 3. Our system correctly recovers 93.8% of word tokens, which corresponds to over 97% of the in-vocabulary words within the test files, The percentage of the in-vocabulary words, which are shown in the Ceiling column, constitute the ef- fective accuracy limits for each language. The errors fall into three categories, as illustrated in Figure 3e. Step 2 introduces decipherment errors (e.g., deciphering ‘s’ as ‘k’ instead of ‘z’ in “orga- nized”), which typically preclude the word from be- ing recovered in the next step. A decoding error in Step 3 may occur when an alphagram corresponds to multiple words (e.g. “greens” instead of “gen- res”), although most such ambiguities are resolved correctly. However, the majority of errors are caused by out-of-vocabulary (OOV) words in the plaintext (e.g., “improvisational”). Since the decoder can only produce words found in the training corpus, an OOV word almost always results in an error. The German ciphers stand out as having the largest percentage of OOV words (8.2%), which may be attributed to fre- quent compounding. Table 3 shows the results of the analogous exper- iments on abjads (Section 4.3). Surprisingly, the removal of vowels from the plaintext actually im- proves the average decipherment step accuracy to 99%. This is due not only to the reduced number of 81 Step 2 Step 3 Both Ceiling English 99.9 84.6 84.5 98.7 Bulgarian 99.1 71.1 70.4 95.0 German 98.5 73.7 72.9 91.8 Greek 97.7 65.4 63.6 97.0 Spanish 99.8 73.7 73.3 99.0 Average 99.0 73.8 73.1 96.4 Table 3: Word accuracy on the abjad anagram decryption task. distinct symbols, but also to the fewer possible ana- gramming permutations in the shortened words. On the other hand, the loss of vowel information makes the anagram decoding step much harder. However, more than three quarters of in-vocabulary tokens are still correctly recovered, including the original vow- els.3 In general, this is sufficient for a human reader to understand the meaning of the document, and de- duce the remaining words. 5 Voynich Experiments In this section, we present the results of our experi- ments on the VMS. We attempt to identify the source language with the methods described in Section 3; we quantify the similarity of the Voynich words to alphagrams; and we apply our anagram decryption algorithm from Section 4 to the text. 5.1 Data Unless otherwise noted, the VMS text used in our experiments corresponds to 43 pages of the manuscript in the “type B” handwriting (VMS-B), investigated by Reddy and Knight (2011), which we obtained directly from the authors. It con- tains 17,597 words and 95,465 characters, tran- scribed into 35 characters of the Currier alphabet (d’Imperio, 1978). For the comparison experiments, we selected five languages shown in Table 4, which have been suggested in the past as the language of the VMS (Kennedy and Churchill, 2006). Consider- ing the age of the manuscript, we attempt to use corpora that correspond to older versions of the languages, including King James Bible, Bibbia di Gerusalemme, and Vulgate. 3The differences in the Ceiling numbers between Tables 2 and 3 are due to words that are composed entirely of vowels. Language Text Words Characters English Bible 804,875 4,097,508 Italian Bible 758,854 4,246,663 Latin Bible 650,232 4,150,533 Hebrew Tanach 309,934 1,562,591 Arabic Quran 78,245 411,082 Table 4: Language corpora. 5.2 Source Language In this section, we present the results of our cipher- text language identification methods from Section 3 on the VMS text. The closest language according to the letter fre- quency method is Mazatec, a native American lan- guage from southern Mexico. Since the VMS was created before the voyage of Columbus, a New World language is an unlikely candidate. The top ten languages also include Mozarabic (3), Italian (8), and Ladino (10), all of which are plausible guesses. However, the experiments in Section 3.4 demon- strate that the frequency analysis is much less reli- able than the other two methods. The top-ranking languages according to the de- composition pattern method are Hebrew, Malay (in Arabic script), Standard Arabic, and Amharic, in this order. We note that three of these belong to the Semitic family. The similarity of decomposition patterns between Hebrew and the VMS is striking. The Bhattacharyya distance between the respective distributions is 0.020, compared to 0.048 for the second-ranking Malay. The histogram in Figure 4 shows Hebrew as a single outlier in the leftmost bin. In fact, Hebrew is closer to a sample of the VMS of a similar length than to any of the remaining 379 UDHR samples. The ranking produced by the the trial decipher- ment method is sensitive to parameter changes; how- ever, the two languages that consistently appear near the top of the list are Hebrew and Esperanto. The high rank of Hebrew corroborates the outcome of the decomposition pattern method. Being a relatively recent creation, Esperanto itself can be excluded as the ciphertext language, but its high score is remark- able in view of the well-known theory that the VMS text represents a constructed language.4 We hypoth- 4The theory was first presented in the form of an ana- 82 Figure 4: Histogram of distances between the VMS and samples of 380 other languages, as determined by the de- composition pattern method. The single outlier on the left is Hebrew. esize that the extreme morphological regularity of Esperanto (e.g., all plural nouns contain the bigram ‘oj’) yields an unusual bigram character language model which fits the repetitive nature of the VMS words. In summary, while there is no complete agree- ment between the three methods about the most likely underlying source language, there appears to be a strong statistical support for Hebrew from the two most accurate methods, one of which is robust against anagramming. In addition, the language is a plausible candidate on historical grounds, being widely-used for writing in the Middle Ages. In fact, a number of cipher techniques, including anagram- ming, can be traced to the Jewish Cabala (Kennedy and Churchill, 2006). 5.3 Alphagrams In this section, we quantify the peculiarity of the VMS lexicon by modeling the words as alphagrams. We introduce the notion of the alphagram distance, and compute it for the VMS and for natural language samples. We define a word’s alphagram distance with re- spect to an ordering of the alphabet as the number of letter pairs that are in the wrong order. For example, with respect to the QWERTY keyboard order, the word rye has an alphagram distance of 2 because it contains two letter pairs that violate the order: (r, e) and (y, e). A word is an alphagram if and only if its alphagram distance is zero. The maximum alpha- gram distance for a word of length n is equal to the number of its distinct letter pairs. gram (Friedman and Friedman, 1959). See also a more recent proposal by Balandin and Averyanov (2014). In order to quantify how strongly the words in a language resemble alphagrams, we first need to identify the order of the alphabet that minimizes the total alphagram distance of a representative text sample. The decision version of this problem is NP- complete, which can be demonstrated by a reduc- tion from the path variant of the traveling salesman problem. Instead, we find an approximate solution with the following greedy search algorithm. Starting from an initial order in which the letters first occur in the text, we repeatedly consider all possible new positions for a letter within the current order, and choose the one that yields the lowest total alphagram distance of the text. This process is repeated until no better order is found for 10 iterations, with 100 ran- dom restarts. When applied to a random sample of 10,000 word tokens from the VMS, our algorithm yields the or- der 4BZOVPEFSXQYWC28ARUTIJ3*GHK69MDLN5, which corresponds to the average alphagram dis- tance of 0.996 (i.e., slightly less than one pair of let- ters per word). The corresponding result on English is jzbqwxcpathofvurimslkengdy, with an average alphagram distance of 2.454. Note that the letters at the beginning of the sequence tend to have low frequency, while the ones at the end occur in popular morphological suffixes, such as −ed and −ly. For example, the beginning of the first arti- cle of the UDHR with the letters transposed to fol- low this order becomes: “All ahumn biseng are born free and qaule in tiingdy and thrisg.” To estimate how close the solution produced by our greedy algorithm is to the actual optimal solu- tion, we also calculate a lower bound for the total alphagram distance with any character order. The lower bound is ∑ x,y min(bxy, byx), where bxy is the number of times character x occurs before character y within words in the text. Figure 5 shows the average alphagram distances for the VMS and five comparison languages, each represented by a random sample of 10,000 word to- kens which exclude single-letter words. The Ex- pected values correspond to a completely random intra-word letter order. The Lexicographic values correspond to the standard alphabetic order in each language. The actual minimum alphagram distance is between the Lower Bound and the Computed Min- imum obtained by our greedy algorithm. 83 Figure 5: Average word alphagram distances. The results in Figure 5 show that while the ex- pected alphagram distance for the VMS falls within the range exhibited by natural languages, its mini- mum alphagram distance is exceptionally low. In absolute terms, the VMS minimum is less than half the corresponding number for Hebrew. In relative terms, the ratio of the expected distance to the min- imum distance is below 2 for any of the five lan- guages, but above 4 for the VMS. These results sug- gest that, if the VMS encodes a natural language text, the letters within the words may have been re- ordered during the encryption process. 5.4 Decipherment Experiments In this section, we discuss the results of applying our anagram decryption system described in Section 4 to the VMS text. We decipher each of the first 10 pages of the VMS-B using the five language models derived from the corpora described in Section 5.1. The pages con- tain between 292 and 556 words, 3726 in total. Fig- ure 6 shows the average percentage of in-vocabulary words in the 10 decipherments. The percentage is significantly higher for Hebrew than for the other languages, which suggests a better match with the VMS. Although the abjad versions of English, Ital- ian, and Latin yield similar levels of in-vocabulary words, their distances to the VMS language accord- ing to the decomposition pattern method are 0.159, 0.176, and 0.245 respectively, well above Hebrew’s 0.020. None of the decipherments appear to be syntac- tically correct or semantically consistent. This is expected because our system is designed for pure monoalphabetic substitution ciphers. If the VMS indeed represents one of the five languages, the amount of noise inherent in the orthography and the transcription would prevent the system from pro- ducing a correct decipherment. For example, in a hypothetical non-standard orthography of Hebrew, some prepositions or determiners could be written as separate one-letter words, or a single phoneme could have two different representations. In addi- tion, because of the age of the manuscript and the variety of its hand-writing styles, any transcription requires a great deal of guesswork regarding the separation of individual words into distinct symbols (Figure 1). Finally, the decipherments necessarily reflect the corpora that underlie the language model, which may correspond to a different domain and his- torical period. Nevertheless, it is interesting to take a closer look at specific examples of the system output. The first line of the VMS (VAS92 9FAE AR APAM ZOE ZOR9 QOR92 9 FOR ZOE89) is deciphered into Hebrew as אנשיו עלי ו לביחו אליו איש Nהכה לה ועשה 5.המצות According to a native speaker of the lan- guage, this is not quite a coherent sentence. How- ever, after making a couple of spelling corrections, Google Translate is able to convert it into passable English: “She made recommendations to the priest, man of the house and me and people.” 6 Even though the input ciphertext is certainly too noisy to result in a fluent output, the system might still manage to correctly decrypt individual words in a longer passage. In order to limit the influence of context in the decipherment, we restrict the word language model to unigrams, and apply our sys- tem to the first 72 words (241 characters)7 from the “Herbal” section of the VMS, which contains draw- ings of plants. An inspection of the output reveals several words that would not be out of place in a me- dieval herbal, such as הצר ‘narrow’, איכר ‘farmer’, אור ‘light’, אויר ‘air’, אׁש ‘fire’. The results presented in this section could be in- terpreted either as tantalizing clues for Hebrew as 5Hebrew is written from right to left. 6https://translate.google.com/ (accessed Nov. 20, 2015). 7The length of the passage was chosen to match the number of symbols in the Phaistos Disc inscription. 84 Figure 6: Average percentage of in-vocabulary words in the decipherments of the first ten pages of the VMS. the source language of the VMS, or simply as ar- tifacts of the combinatorial power of anagramming and language models. We note that the VMS deci- pherment claims in the past have typically been lim- ited to short passages, without ever producing a full solution. In any case, the output of an algorithmic decipherment of a noisy input can only be a starting point for scholars that are well-versed in the given language and historical period. 6 Conclusion We have presented a multi-stage system for solv- ing ciphers that combine monoalphabetic letter sub- stitution and unconstrained intra-word letter trans- position to encode messages in an unknown lan- guage.8 We have evaluated three methods of cipher- text language identification that are based on letter frequency, decomposition patterns, and trial deci- pherment, respectively. We have demonstrated that our language-independent approach can effectively break anagrammed substitution ciphers, even when vowels are removed from the input. The application of our methods to the Voynich manuscript suggests that it may represent Hebrew, or another abjad script, with the letters rearranged to follow a fixed order. There are several possible directions for the future work. The pipeline approach presented in this pa- per might be outperformed by a unified generative model. The techniques could be made more resis- tant to noise; for example, by softening the emission model in the anagram decoding phase. It would also be interesting to jointly identify both the language and the type of the cipher (Nuhn and Knight, 2014), 8Software at https://www.cs.ualberta.ca/˜kondrak/. which could lead to the development of methods to handle more complex ciphers. Finally, the anagram decoding task could be extended to account for the transposition of words within lines, in addition to the transposition of symbols within words. Acknowledgements We thank Prof. Moshe Koppel for the assessment of the Hebrew examples. We thank the reviewers for their comments and suggestions. This research was supported by the Natural Sci- ences and Engineering Research Council of Canada, and by Alberta Innovates – Technology Futures and Alberta Innovation & Advanced Education. References Arcady Balandin and Sergey Averyanov. 2014. The Voynich manuscript: New approaches to deciphering via a constructed logical language. A. Bhattacharyya. 1943. On a measure of divergence be- tween two statistical populations defined by their prob- ability distributions. Bull. Calcutta Math. Soc., 35:99– 109. Mary E. d’Imperio. 1978. The Voynich manuscript: An elegant enigma. Technical report, DTIC Document. Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer, and Michaela Regneri. 2014. Seedling: Building and using a seed corpus for the human lan- guage project. In Workshop on the Use of Computa- tional Methods in the Study of Endangered Languages, pages 77–85. Joseph Martin Feely. 1943. Roger Bacon’s Cypher. The Right Key Found. Rochester, NY. William F. Friedman and Elizebeth S. Friedman. 1959. Acrostics, anagrams, and Chaucer. Philological Quar- terly, 38(1):1–20. Bradley Hauer, Ryan Hayward, and Grzegorz Kondrak. 2014. Solving substitution ciphers with combined lan- guage models. In COLING, pages 2314–2325. Grzegorz Jaskiewicz. 2011. Analysis of letter frequency distribution in the Voynich manuscript. In Interna- tional Workshop on Concurrency, Specification and Programming (CS&P’11), pages 250–261. Frederick Jelinek and Robert L. Mercer. 1980. Inter- polated estimation of Markov source parameters from sparse data. Pattern recognition in practice. Gerry Kennedy and Rob Churchill. 2006. The Voynich manuscript: The mysterious code that has defied inter- pretation for centuries. Inner Traditions/Bear & Co. 85 Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Ya- mada. 2006. Unsupervised analysis for decipherment problems. In COLING/ACL, pages 499–506. Kevin Knight, Beáta Megyesi, and Christiane Schaefer. 2011. The Copiale cipher. In 4th Workshop on Build- ing and Using Comparable Corpora: Comparable Corpora and the Web, pages 2–9. Philipp Koehn. 2005. Europarl: A parallel corpus for sta- tistical machine translation. In MT Summit, volume 5, pages 79–86. Gabriel Landini. 2001. Evidence of linguistic struc- ture in the Voynich manuscript using spectral analysis. Cryptologia, 25(4):275–295. John Matthews Manly. 1931. Roger Bacon and the Voynich MS. Speculum, 6(03):345–391. Marcelo A. Montemurro and Damián H. Zanette. 2013. Keywords and co-occurrence patterns in the Voynich manuscript: An information-theoretic analysis. PloS one, 8(6):e66344. George Nagy, Sharad Seth, and Kent Einspahr. 1987. Decoding substitution ciphers by means of word matching with application to OCR. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 9(5):710–715. William Romaine Newbold and Roland Grubb Kent. 1928. The Cipher of Roger Bacon. University of Pennsylvania Press. Peter Norvig. 2009. Natural language corpus data. In Toby Segaran and Jeff Hammerbacher, editors, Beau- tiful data: The stories behind elegant data solutions. O’Reilly. Malte Nuhn and Kevin Knight. 2014. Cipher type detec- tion. In EMNLP, pages 1769–1773. Sujith Ravi and Kevin Knight. 2008. Attacking deci- pherment problems optimally with low-order n-gram models. In EMNLP, pages 812–819. Sujith Ravi and Kevin Knight. 2009. Learning phoneme mappings for transliteration without parallel data. In NAACL, pages 37–45. Sravana Reddy and Kevin Knight. 2011. What we know about the Voynich manuscript. In 5th ACL-HLT Work- shop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 78–86. Andrew Robinson. 2002. Lost languages: The enigma of the world’s undeciphered scripts. McGraw-Hill. Gordon Rugg. 2004. An elegant hoax? A possible solution to the Voynich manuscript. Cryptologia, 28(1):31–46. Andreas Schinner. 2007. The Voynich manuscript: Evi- dence of the hoax hypothesis. Cryptologia, 31(2):95– 107. Klaus Schmeh. 2013. A milestone in Voyn- ich manuscript research: Voynich 100 conference in Monte Porzio Catone, Italy. Cryptologia, 37(3):193– 203. Simon Singh. 2011. The code book: The science of secrecy from ancient Egypt to quantum cryptography. Anchor. Leonell C Strong. 1945. Anthony Askham, the author of the Voynich manuscript. Science, 101(2633):608– 609. John Tiltman. 1968. The Voynich Manuscript, The Most Mysterious Manuscript in the World. Baltimore Bib- liophiles. 86