key: cord-1048475-sr0sl4bl
authors: Fimmel, Elena; Michel, Christian J.; Pirot, François; Sereni, Jean-Sébastien; Starman, Martin; Strüngmann, Lutz
title: The Relation Between k-Circularity and Circularity of Codes
date: 2020-08-04
journal: Bull Math Biol
DOI: 10.1007/s11538-020-00770-7
sha: 7cdb258890409d1cc83a4cac94c9997a0b6cc1a1
doc_id: 1048475
cord_uid: sr0sl4bl

A code X is k-circular if any concatenation of at most k words from X, when read on a circle, admits exactly one partition into words from X. It is circular if it is k-circular for every integer k. While it is not a priori clear from the definition, there exists, for every pair [Formula: see text] , an integer k such that every k-circular [Formula: see text] -letter code over an alphabet of cardinality n is circular, and we determine the least such integer k for all values of n and [Formula: see text] . The k-circular codes may represent an important evolutionary step between the circular codes, such as the comma-free codes, and the genetic code.

The discovery of the DNA structure by Watson and Crick (1953) spurred a new branch of mathematical biology which overlaps the theory of block codes, i.e. codes consisting of words of a fixed length over some finite alphabet. A relevant concept in this biomathematical context is that of comma-freeness, meaning that code words can be separated without using extra symbols ("commas"). This concept was later weakened to that of a circular code, meaning that the reading frame can always be retrieved in any word written on a circle.

The DNA structure is a sequence of nucleotides on the 4-letter alphabet {A, C, G, T }, where A stands for adenine, C for cytosine, G for guanine and T for thymine, organized in an antiparallel and complementary double helix. A (protein coding) gene is a DNA sequence which is read during the translation process by words of three letters also called trinucleotides or codons. The genetic code is a map between the 64 possible codons and the 20 amino acids constituting the proteins and the three stop codons. Soon after this discovery, scientists believed that the redundancy in the codon amino acid assignment must be some kind of code used by nature for error detection. Crick et al. (2003) proposed that in such a code, no codon can be obtained by concatenating a nonempty suffix and a nonempty prefix of codons in the code. It follows that a frameshift during translation would be detected immediately. This is the definition of comma-free codes.

One year later, Golomb et al. (1958a, b) introduced a mathematical definition of comma-free codes as block codes. One of their discoveries showed that a commafree code on a 4-letter alphabet with words of three letters is maximal if the code contains exactly 20 of the 64 trinucleotides. In other words, there is no comma-free code with 21 and more trinucleotides. The interesting fact that the maximal number of codons in a comma-free code is equal to the number of amino acids motivated theoretical researchers all over the world in the late 1950s to solve the gene coding.

Only a short period later, at the beginning of the 1960s, it was finally proven that the genetic code cannot be a comma-free code as the trinucleotide TTT, an excluded trinucleotide in a comma-free code, codes an amino acid: the phenylalanine (Nirenberg and Matthaei 1961) . Within the following years, comma-free codes gained almost no attention in biology and were rather studied from the point of view of coding and information theory; the work by Lam (2003) entitled: "Completing comma-free codes" seemed to uncover the last relevant properties of comma-free codes.

Before Lam's work, Arquès and Michel (1996) discovered after a statistical computation of the 64 trinucleotides in each of the three frames of genes of bacteria and eukaryotes (2 · 3 · 64 = 384 trinucleotides analysed) that a simple inspection identifies 20 codons which occur preferentially in reading frames (compared to the two shifted frames). Furthermore, this set X of 20 codons forms a circular code: The trinucleotides underlined in blue belong to X , and the trinucleotides underlined in red do not belong to X (Color figure online) Circular codes come from the same family as comma-free codes, but are slightly less restrictive (Berstel et al. 2005, p. 233) . They enable to discover frameshifts not immediately, but after the reading of a finite (bounded) number of codons (called letter window of the circular code). Thus, they may have an important function of frame retrieval during the translational process in the ribosome. Figure 1 illustrates that in any sequence of trinucleotides from X , a frameshift of one or two positions will be detected.

The set X identified in genes has additional strong properties: it is also selfcomplementary and C 3 (Arquès and Michel 1996) . A code is self-complementary if each codon in the code is associated with its anticodon in the code. A circular code is C 3 if the two codes resulting from the two circular permutations of all codons of the code are also circular (reviews in Michel 2008; Fimmel and Strüngmann 2018 for further details on these properties).

Three approaches were developed to study circular codes (classes, hierarchies: (strong comma-free, comma-free, circular), numbers, properties of their prefixes and suffixes, window length to retrieve the word decomposition, etc.): (i) the flower automaton (Arquès and Michel 1996) ; (ii) the necklaces L DN (letter diletter necklace) and DL N (diletter letter necklace) (Pirillo 2003; Michel et al. 2008a, b) extended to (n +1)L DCC N (letter diletter continued closed necklaces) (Michel and Pirillo 2010) ; and (iii) the recent and powerful approach based on graph theory proposed by Fimmel et al. (2016) , which allows to characterize the strong comma-free codes (Fimmel et al. 2017b) , the comma-free codes and the circular codes according to the properties of their associated graphs. The graph associated with a code is directed, and depending on its paths, the class of the code can be deduced. As we shall see, even the minimum length of a sequence needed to ensure the discovery of a frameshift can be efficiently deduced from this graph.

The study of the biological function of circular codes in genes is progressing. Firstly, the circular code X (1.1) is also identified in genes of archaea, plasmids and viruses, in addition to bacteria and eukaryotes, and by two different statistical approaches (Michel 2015 (Michel , 2017 . Furthermore, motifs from the circular code X , called X -motifs, are significantly enriched in the genes of most organisms, from bacteria to eukaryotes , and also exist in the ribosomal and transfer RNAs (rRNAs and tRNAs) (Michel 2012 (Michel , 2013 Michel 2014, 2015; Michel and Thompson 2020) . However, these X -motifs in genes are separated by words which are not in the code X . In order to find a theory which may give a solution to this problem, this work aims at the class of k-circular codes which is another code family of block codes. The k-circular codes are even less restrictive than circular codes and can potentially explain the gaps between the X -motifs in sequences.

This work deals with the general definition of circular codes followed by the boundaries of such. First, it is not evident from the definition itself that such a boundary exists. However, from previous works (Fimmel et al. 2016 (Fimmel et al. , 2017a , one realizes that checking whether a given code is circular can be done efficiently by building the associated graph (see Definition 3.1) and checking whether it contains a cycle. Such an argument traces back a potentially infinite task to a finite one. We delve more into the properties of a noncircular code that are captured by the cycles of its associated graph and uncover a link to the k-circularity of the code. Specifically, we determine precisely, for every finite alphabet and every word length , the existence of an integer k such that every k-circular -letter code over must also be circular. Further, we determine the exact boundary between k-circularity and circularity by finding the least such integer k.

Section 2 provides formal definitions and notations. We establish in Sect. 3 a graph characterization of the notion of k-circularity with Theorem 3.3. Section 4 states our main results: Theorems 4.1 and 4.4. The Existence Theorem 4.1 is demonstrated with a proof based on Theorem 3.3. Section 5 is devoted to prove the Sharpness Theorem 4.4. In Sect. 6, we propose an evolutionary hypothesis for the k-circular codes in gene translation. In particular, we prove that there are exactly 52 maximal 1-circular codes that encode all 20 amino acids, while the 12,964,440 maximal circular codes encode at most 18 amino acids.

Let be an arbitrary finite alphabet of cardinality at least 2 and set n := | |. To emphasise the special important case of the genetic alphabet, which has cardinality 4, we set B := {A, C, G, T }. We use standard word-theory notation: * := ≥0 . In other words, * is the set of all finite words with letters in , including the empty word.

If w = w 1 · · · w ∈ for some ∈ N, then for every j ∈ {0, . . . , − 1}, the circular j-shift of w is the word w j+1 · · · w w 1 · · · w j . In particular, the circular 0shift of w is w itself. A word w is a circular shift of w if w is the circular j-shift of w for some j ∈ {0, . . . , − 1}. If w = w 1 · · · w ∈ , then the concatenation of w and w is the word w · w := w 1 · · · w w 1 · · · w ∈ + .

Suppose that w = w 1 · w 2 . If none of w 1 and w 2 is the empty word, then w 1 is a proper prefix of w and w 2 is a proper suffix of w.

Definition 2.1 Let be a finite alphabet.

• For an integer ≥ 2, an -letter code is a subset of .

• For w ∈ * and X ⊆ * , an X -decomposition of w is a tuple (x 1 , . . . , x t ) ∈ X t with t ∈ N such that w = x 1 · x 2 · · · x t .

We now formally define the notion of circularity of a code.

Definition 2.2 Let X ⊆ be an -letter code.

• Let m be a positive integer, and let (c 1 , . . . , c m ) ∈ X m . A circular Xdecomposition of the concatenation c := c 1 · · · c m is an X -decomposition of a circular shift of c. • Let k be a nonnegative integer. The code X is k-circular if for every m ∈ {1, . . . , k} and every m-tuple (c 1 , . . . , c m ) of words in X , the concatenation c 1 · · · c m admits a unique circular X -decomposition. Note that every code is trivially 0-circular.

Several examples of codes that are k-circular but not (k + 1)-circular will be given in Example 4.5, but to illustrate Definition 2.2, we state one example here.

Let be the binary alphabet {0, 1}. Then, X = {0001, 0111, 1100} is a 2-circular but not 3-circular 4-letter code. Indeed, computer calculations show that X is 2-circular but it is easy to see that the code X is not 3-circular since the word w = 0111 · 0001 · 1100 has a second circular X -decomposition: that of its 2-shift 1100 · 0111 · 0001.

We will make use of graph theory to study the circularity of codes. To this end, we establish in the next section a graph characterization of k-circularity.

A new graph approach for studying circular codes (cf. Definition 3.1) has been recently suggested (Fimmel et al. 2016) . It (nontrivially) generalises an older approach used for 2-letter words (Ball and Cummings 1976) . The original main results were formulated only for the genetic alphabet (Fimmel et al. 2016 ) but they could be easily extended to any letter words on any finite alphabet (Fimmel et al. 2017a ). Thus, an -letter code is circular if and only if its associated graph is acyclic. Let us define the graph associated with a code.

Definition 3.1 Let be an integer greater than 1, and let X ⊆ be an -letter code. We define a graph G(X ) = (V (X ), E(X )) with set of vertices V (X ) and set of arcs E(X ) as follows:

• V (X ) is composed of all proper prefixes and all proper suffixes of words in X ,

The graph G(X ) is the graph associated with X . For each i ∈ {1, . . . , }, the vertices of G(X ) that correspond to words of length i are referred to as i-nodes. (1) For every j ∈ {1, . . . , /2 }, the j-component of G(X ) contains at most n j +n − j vertices if j = /2 and at most n /2 otherwise. (2) If is odd, then G(X ) contains no directed cycle of odd length.

(3) If is even, then every directed cycle of odd length in G(X ) is contained in the ( /2)-component. (4) Suppose that G(X ) contains a directed cycle of length t, and let j ∈ {1, . . . , /2 } such that this cycle is contained in the j-component. Then,

The k-circularity of dinucleotide codes was characterized in terms of graphs (Fimmel et al. 2016 ):

• a dinucleotide code is 1-circular but not 2-circular if and only if its associated graph contains a Hamiltonian cycle of length 4; • a dinucleotide code is 2-circular but not 3-circular if and only if its associated graph contains an oriented cycle of length 3 and has no Hamiltonian cycle; and • a dinucleotide 3-circular code is circular.

Our next theorem is a natural generalization of this approach. We need the following notation. If X is a code then g o (X ) and g e (X ) are the respective lengths of the shortest directed cycles of odd length and of even length in G(X ). We define g o (X ) := ∞ and g e (X ) := ∞ if cycles of odd length and of even length do not exist in G(X ), respectively. If the code X is not circular, then one of these numbers must be finite. Theorem 3.3 Let be a finite alphabet, an integer greater than 1 and X ⊆ an -letter code over . Then, the code X is k-circular and not (k + 1)-circular if and only if

Proof Suppose first that w 1 · · · w r is a directed cycle in G(X ). If r is even, then the word w 1 · · · w r admits two different circular X -decompositions into r /2 words from X , namely

It follows that X is not 1 2 g e (X )-circular. If r is odd, then the word w 1 · · · w r w 1 · · · w r admits two different circular X -decompositions into r words from X , namely

It follows that X is not g o (X )-circular.

Conversely, suppose that both w and its circular j-shift admit a circular Xdecomposition, where j ∈ {1, . . . , − 1}. By setting a r +1 := a 1 , the word w can be written a 1 b 1 | . . . |a r b r such that |a i | = j, |b i | = − j, and a i b i , b i a i+1 ∈ X for each i ∈ {1, . . . , r }. It follows from Definition 3.1 that

is a closed walk in G(X ). Consequently, W either contains a directed cycle of even length, which then must be of length at most |V (W )| = 2r , or W decomposes into an even number of directed odd cycles, one of them thus having length at most r . Consequently, if X is not r -circular, then g e (X ) ≤ 2r or g o (X ) ≤ r . The conclusion follows.

The results from Fimmel et al. (2016) are consistent with the statement of Theorem 3.3.

• If X is a 2-letter code that is 1-circular but not 2-circular, then G(X ) contains cycles of lengths 3 and 4, and indeed 1 = min 3, 4 2 − 1. • If X is a 2-letter code that is 2-circular but not 3-circular, then G(X ) contains a cycle of length 3 and no cycle of even length, and indeed 2 = min{3, ∞} − 1.

Theorem 3.3 implies, in particular, that if a code X is not circular, then the graph G(X ) contains a cycle. By Observation 3.2, the length of a cycle in G(X ) is bounded by a function f that depends only on the cardinality n of the alphabet and the length of the words in X . This implies the existence of a barrier k = k(n, ) such that k-circularity implies circularity for every -letter code over an alphabet of cardinality n, for all values of and n. In the next section, we will state our main theorems that determine completely this barrier.

The main purpose of this section is to study when k-circularity implies circularity. It is easy to see that these two notions are not equivalent, the notion of k-circularity being weaker than that of circularity (see and the examples below). Thus, it is natural to ask, given integers n and both at least 2, what is the least integer k(n, ) ∈ N such that every code X ⊆ that is k(n, )-circular is circular. (The cases where n or is 1 are trivial.) We provide a full answer to this question in our main results Theorem 4.1 (existence (and value) of a bound) and Theorem 4.4 (sharpness of the bound given).

Let be an alphabet with | | = n, and X ⊆ an -letter code over . Set

if is even and n is odd, n /2 − 1 if is even and n is even.

Before we proceed with the proof of Theorem 4.1, let us note that it readily implies the following results with dinucleotide and trinucleotide circular codes over the genetic alphabet, which were established previously Strüngmann 2015, 2016) .

Proof (1) In this case n = 4 and = 3, and hence according to Theorem 4.1, the code X is circular if and only if it is 4 3−1 2 = 4 1 = 4-circular.

(2) In this case n = 4 and = 2, and hence according to Theorem 4.1, the code X is circular if and only if it is (4 2 2 − 1) = 3-circular.

We also apply Theorem 4.1 to the case of a binary alphabet, i.e. with n = 2, and obtain the following statement.

(1) 2 −1 2 -circular if is odd; or (2) (2 2 − 1)-circular if is even.

One direction is trivial: if X is circular, then X is r -circular for every integer r , and hence for k(n, ).

Conversely, let X be an -letter code that is r -circular but not (r + 1)-circular. By Theorem 3.3, the graph G(X ) contains a cycle of (even) length 2(r + 1), or r is even and G(X ) contains a cycle of (odd) length r + 1.

(i) If is odd, then Observation 3.2 yields that g o (X ) = ∞, and therefore in this case r + 1 = 1 2 g e (X ) < g o (X ). Because is odd, Observation 3.2 implies that

Consequently, r + 1 ≤ n ( −1)/2 and hence r ≤ k(n, ) − 1.

(ii) Assume now that is even. Note that if r +1 = 1 2 g e (X ), then by Observation 3.2,

n /2 − 1 ifn is odd (and is even), n /2 − 2 ifn is even (and is even).

We established that whenever

if is even and n is odd, n /2 − 1 if is even and n is even, every -letter code over an alphabet of cardinality n that is r -circular must be circular. This concludes the proof.

The immediate question that remains open now is whether or not the bounds given in the Existence Theorem 4.1 are sharp. This is indeed the case as our next theorem asserts. The proof takes several pages, and we defer it to Sect. 5.

Given two integers n and both at least 2, let be an alphabet with | | = n and X ⊆ an -letter code over . Then, k(n, ) is the least integer r such that every code X ⊆ that is r -circular is circular.

For the convenience of the reader, we close this section by giving explicitly codes that are (k(n, )− 1)-circular but not circular for some specific values of n and . They all come from the constructions presented in the proof of Theorem 4.4. These codes are the smallest existing ones, and they are also minimal regarding Theorem 3.3, as we will explain at the beginning of Sect. 5. Further examples for the case where is odd and n ≥ 3 are given at the end of Sect. 5 after the general construction is explained.

Example 4.5 (1) Let n = 2 and = 3. Every 2-circular 3-letter code is circular, and there are 1-but not 2-circular 3-letter codes:

(2) Let n = 2 and = 4. Every 3-circular 4-letter code is circular, and there are 2but not 3-circular 4-letter codes over {0, 1}:

(3) Let n = 2 and = 5. Every 4-circular 5-letter code is circular, and there are 3but not 4-circular 5-letter codes:

(4) Let n = 2 and = 7. Every 8-circular 7-letter code is circular, and there are 7but not 8-circular 8-letter codes:

X (2, 7) = {0001000, 0010010, 0101100, 0111110, 1000001, 1011011, 1100101, 1110111}.

(5) Let n = 3 and = 4. Every 9-circular 4-letter code is circular, and there are 8but not 9-circular 4-letter codes over {0, 1, 2}:

, 0102, 0210, 1011, 1120, 2021, 2122, 2200}.

(6) Let n = 4 and = 3 (that is, trinucleotides over the genetic alphabet). Every 4-circular 3-letter code is circular, and there are 3-but not 4-circular 3-letter code over B:

(7) Let n = 4 and = 4 (that is, tetranucleotides over the genetic alphabet). Every 15-circular 4-letter code is circular, and there are 14-but not 15-circular 4-letter codes over B:

In this section, we prove the Sharpness Theorem 4.4, i.e. we prove that k(n, ) is the least integer such that every code X ⊆ that is k(n, )-circular is circular. For all integers n and both at least 2, we shall construct a code that is (k(n, ) − 1)circular but not circular. In order to do so, we construct codes such that the graphs associated with them contain a unique cycle, of length k(n, ) if k(n, ) is odd and, as required, of length 2k(n, ) otherwise. Theorem 3.3 guarantees then that such a code is (k(n, ) − 1)-circular but not circular. We break the demonstration into three cases: when is even (Lemma 5.1) and when is odd (Lemma 5.2). The case when is odd is again split into two cases: when is odd and n = 2 (Lemma I.1) and finally when is odd and n ≥ 3 (Lemma I.2). The proofs of Lemmas I.1 and I.2 are quite long and are therefore contained in Appendix 1. However, there are new mathematical tools for analysing codes, which may identify additional properties in the genetic code in the future. We indicate here the main idea of the constructions. We already would like to point out that, although the approaches to establish Lemmas I.1 and I.2 are similar, there are some differences. It seems therefore more natural to separate the binary case, which also helps to comprehend the strategy in a more restricted case. Given an integer n ≥ 2, let n = {0, . . . , n − 1} be the alphabet of cardinality n consisting of the first n integers. We first deal with the case when is even.

Lemma 5.1 If n and are integers both at least 2 and is even, then there is an -letter code over that is (k(n, ) − 1)-circular but not circular where k(n, ) := n /2 if n is odd and k(n, ) := n /2 − 1 if n is even.

Proof Note that k(n, ) is always odd in this setting. The situation where n = 2 = is trivial: every code is 0-circular, and the code {00} is not 1-circular.

We now assume that (n, ) = (2, 2). We construct an -letter code X (n, ) over n such that G(X (n, )) contains a unique cycle, which is of odd length k(n, ). It thus follows that X (n, ) is (k(n, ) − 1)-circular but not k(n, )-circular by Theorem 3.3. The code is constructed by using n-adic representations of numbers below n /2 ; see Eq. (5.1).

We first rule out a boundary case: if n = 2 and = 4, then one verifies directly that X (2, 4) := {0001, 1100, 0111} is 2-circular but not 3-circular. Indeed, the associated graph G(X (2, 4)), depicted in Fig. 3 , contains a unique directed cycle, which has length 3.

From now on, we assume that (n, ) = (2, 4), so that n ≥ 3 or ≥ 6. Given a length i ≤ /2 and an integer x < n i , we define (x) i to be the word of length i over the alphabet n representing x written in basis n. For example, if n = 3, then (8) 4 = 0022. If w = (x) i , then we also write x = w . To improve readability, we use Y and Z to, respectively, refer to n − 2 and n − 1 when writing words in * n . Let X (n, ) be defined as where the addition is in Z k(n, ) . Note that |X | = k(n, ). For instance, if n = 2 and = 6, then k(n, ) = n /2 − 1 = 7, and X (2, 6) = {000001, 001010, 010011, 011100, 100101, 101110, 110000}.

We assert that the graph G(X (n, )) associated with X (n, ) has a unique cycle, of odd length k(n, ). It then follows from Theorem 3.3 that the -letter code X (n, ) is (k(n, ) − 1)-circular but not k(n, )-circular.

We begin by proving a useful assertion on the code X − (n, ) defined as Proof For each i ∈ {1, . . . , /2}, let C i be the i-component of G − . Note that C /2 is a path of length k(n, )−1, traversing all nodes (x) /2 in increasing order, for x ∈ Z k(n, ) . It follows that C /2 is acyclic. We now fix i ∈ {1, . . . , /2 − 1} and focus our attention on the component C i , which has no odd cycle. Let y = y 1 · y 2 · y 3 be an

(1) By the definition of G − , if there is an arc from a node x to y, then x is an i-node and it holds in Z k(n, ) that y 2 · y 3 = x · y 1 + 1.

There are two possible cases:

(a) y 2 = x and y 3 = y 1 + 1; or (b) y 2 = x + 1, y 1 = Z /2−i and y 3 = 0 /2−i .

Note that y 1 = y 3 in both cases.

(2) On the other hand, if there is an arc from y to a node x , then x is an i-node and it holds in Z k(n, ) that y 3 · x = y 1 · y 2 + 1.

Again there are two possible cases:

(a) y 3 = y 1 and x = y 2 + 1; or (b) y 3 = y 1 + 1, y 2 = Z i and x = 0 i .

Let us assume that y has both an in-going arc x → y and an outgoing arc y → x , where it might be that x = x . Because of the arc x → y, we know that y 1 = y 3 , so in particular Case (2.a) is impossible; hence, we must fall into Case (2.b). Therefore, x = 0 i and y 3 = y 1 + 1, and we infer that the arc x → y fell into case (1.a), so x = y 2 = Z i . This proves the "moreover" part of the assertion. It directly follows that 0 i does not belong to a directed cycle in C i , and consequently no ( − i)-node belongs to a directed cycle in C i . Because all arcs in C i are between an i-node and an ( − i)-node, we deduce that C i is acyclic, thereby concluding the proof of (A).

We now show that G(X ) contains exactly one cycle, which spans its whole component C /2 . The code X is obtained from X − by adding the word Z /2−1 Y 0 /2 if n is even, or the word Z /2 0 /2 if n is odd. Let us see how G(X ) is obtained from G − . As noted earlier, the /2-component of G − is a path of length n /2 − 1 traversing all nodes (x) /2 for x ∈ Z k(n, ) in increasing order. Because the /2-component of G(X ) is obtained from that of G − by adding an arc from (k(n, ) − 1) /2 to (0) /2 , we precisely obtain a cycle of length k(n, ). For each i ∈ {1, . . . , /2 − 1}, the component C i is obtained from the i-component of G − by adding an arc outgoing from the i-node Z i and an arc in-going to the i-node 0 i . This creates no cycle of length at least 4, since G − contains no directed path from 0 i to Z i . The only cycles that might be created are therefore of length 2, and there are two possible ones. The first possibility is to create a cycle containing precisely the i-node 0 i and either the ( − i)node Z /2−1 Y 0 /2−i if n is even, or the ( − i)-node Z /2 0 /2−i if n is odd. If this cycle is created, then

This is possible only if i = 1, and either Y = 1 and n is even, or Z = 1 and n is odd. However, the parity of n is the same as that of Y = n − 2, and different from that of Z = n − 1, so this first possible cycle is not created. The other possible cycle is the one containing the i-node Z i and either the ( − i)node Z /2−i−1 Y 0 /2 if n is even, or the ( − i)-node Z /2−i 0 /2 if n is odd. If this cycle is created, then

If n is odd, then the equality implies that i = /2, which is not the case. If n is even, then the equality implies that /2 − i = 1 and Y = 0, that is, n = 2. It would follow that 01 i = 0 i+1 + 1, implying i = 1 and hence = 4, which is not the case as we assumed that (n, ) = (2, 4) . This completes the proof of Lemma 5.1. We now proceed with the case where is odd. The full proof is split into two cases given in detail in Appendix 1 (Lemmas I.1 and I.2). Here, we indicate the strategy and main steps of the proof.

Lemma 5.2 Let n and be integers both at least 2 and assume that is odd. There is an -letter code over n that is (n ( −1)/2 − 1)-circular but not circular.

Let us first deal with the case where is odd and n = 2. We have to show that there is an -letter code over n that is (2 ( −1)/2 − 1)-circular but not circular. Let us fix k := k(2, ) = 2 −1 2 . The aim is to construct a binary code X such that its graph G(X ) contains a unique cycle, of length 2k. Let s := −3 2 , S ⊆ {0, 1} s be a subset of binary words of length s and set

where for a binary word w, the complement of w is the word w obtained from w by complementing each of its letters. It will be shown in Lemma I.1 (Appendix 1) that all but one components of G(X ) are acyclic if S ⊆ 01{0, 1} s−2 and that there is a choice for such a set S so that the remaining component consists of exactly one cycle, which solves the binary case. The case where is odd and n > 2 is more delicate. As before, one sets s := −3 2 and k := n −1 2 . We shall define in Lemma I.2 (Appendix 1) a mapping ϕ from s n to the family of 3-letter codes over n with very specific properties so that the associated code X ϕ := a · y · b · y · c : y ∈ s n and abc ∈ ϕ(y) satisfies our requirements. All the details for the construction of ϕ can be found in Lemma I.2 but we would like to notice that in the binary case, we have ϕ(y) = {010, 101} if y / ∈ S and ϕ(y) = {000, 111} otherwise.

Lemmas 5.1 and 5.2 together end the proof of our Sharpness Theorem 4.4.

Almost all living organisms use the same standard genetic code (SGC) to translate 64 trinucleotides (codons) into 20 amino acids and the stop signal. Many hypotheses have been proposed to explain the origin of the genetic code (e.g. reviewed in Koonin 2017; Koonin and Novozhilov 2009) , including the frozen accident theory stating that the genetic code was created randomly and stayed frozen ever since, the stereochemical theory based on stereochemical relationships between amino acids and specific codons (Pelc and Welton 1966; Yarus 2017) , the adaptive theory suggesting that the genetic code was shaped to be maximally robust (Woese 1965; Freeland and Hurst 1998) , and the coevolution theory of the genetic code with amino acid biosynthetic pathways (Wong 1975) . However, it is likely that all these models combined to play a part in the evolution of the genetic code.

The adaptive theory proposes that the genetic code was optimized to minimize the effects of errors during transcription and translation. The most common source of errors, known as missense errors, is the incorrect reading of a codon and the resulting incorporation of the wrong amino acid. From a theoretical point of view, we will show below the existence of a hierarchy of circular codes and k-circular codes allowing a codon decoding without ambiguity, from strong to flexible reading frame constraints. Of course, this model does not exclude the possibility of evolutionary phases with reading frame errors, such as the ribosomal frameshifts that are observed, for example, in the genes of today's viruses (to cite a topical example, the -1 ribosomal frameshift in the gene orf1ab of SARS-CoV-2, NCBI identification MT072688).

Circular codes could have operated in the primitive soup for constructing the modern standard genetic code. Figure 4 proposes an evolutionary hypothesis of the genetic code based on a growing combinatorial hierarchy of k-circular trinucleotide codes where k ∈ {1, 2, 3, 4} (see Theorem 4.1 with n = 4 and = 3, and Corollary 4.2). Evolution would have started with the circular (4-circular) trinucleotide codes X p with an increasing complexity according to the maximal path length p (from 1 to 8) in their associated graph G(X p ). As the maximal path length p is related to the window nucleotide length of reading frame retrieval, the circular codes X 1 (strong commafree) and X 2 (comma-free) are more constrained than those in X 8 . The maximal C 3self-complementary trinucleotide circular code X observed in genes (1.1) belongs to the class X 8 . Then, evolution continued with the three classes of k-circular codes, where k ∈ {1, 2, 3}, which are less constrained than the classes of circular codes. As such, these three classes with a partial circularity property could represent an intermediate step in the code evolution between circular codes which have a complete circularity property, and the extant genetic code SGC where the circularity property is totally lost (Fig. 4) . As a consequence, the circular code motifs, in particular the Xmotifs of the circular code X (1.1) which are significantly enriched in the genes of most organisms , always retrieve the reading frame. In contrast, the k-circular code motifs may not always find out the reading frame.

This code evolution is also supported by the number of amino acids which are coded by the circular codes. There are 12,964,440 maximal circular codes. No maximal circular code among these 12,964,440 ones codes for 20 or 19 amino acids with SGC. Ten maximal circular codes code for 18 amino acids with SGC (see Michel and Pirillo 2013, Introduction) . Interestingly, we identify 52 maximal 1-circular codes among 3 20 = 3,486,784,401 ones, i.e. with a very low probability ≈ 10 −8 , which code for the 20 amino acids (see the list given in Appendix 2 or Michel (2014) , Table 2 where they were called bijective genetic codes without permuted trinucleotides WPTBGC before the theory of k-circular codes was developed in this work). It has already been verified that these 52 maximal 1-circular codes cannot be 2-circular. 1 Moreover, the following construction based on graph theory underlines once more the distinguished role of this class of 52 maximal 1circular codes and gives a theoretical framework for earlier computer results (Michel 2014) .

A bipartite graph is a graph G = (V , E) such that the set of vertices V is the disjoint union of two sets A and B such that any edge in e ∈ E is of the form e = (a, b) for some a ∈ A and b ∈ B, i.e. no edge connects two vertices both from A or two vertices both from B. A perfect matching of G is a subset M ⊆ E such that the edges of M form a bijection between the sets A and B. A perfect matching can thus exist only if A and B have the same cardinality. We are now ready to prove the existence of the 52 maximal 1-circular codes that encode all 20 amino acids. Lemma 6.1 There are exactly 52 maximal 1-circular codes that encode all 20 amino acids.

We first need to recall some facts . An equivalence class [c] of some codon c ∈ B 3 consists of c and its circular shifts, e.g. Moreover, the set of edges E consists of all pairs (D i , aa) such that there is a codon in D i that encodes the amino acid aa. It is now easy to see that the maximal 1-circular codes that encode all 20 amino acids are in bijective correspondence with the perfect matchings of the constructed graph G. An application of established algorithms for calculating perfect matchings of a graph, e.g. the Hungarian algorithm, now yields the list of 52 perfect matchings of G. This completes the proof.

During this work, a new relation came to light between the combinatorial hierarchy of circular codes (Fig. 4) and their probability measure of reading frame coding R F R (or reading frame retrieval) within two successive codons ( Fig. 6 ; see the method in Michel 2014 for detail). This R F R measure ranges from 1/3 (one chance out of three to retrieve the reading frame among the three possible frames in genes) with the random codes, e.g. B 3 , to 1 (the reading frame is always retrieved) with the strong comma-free codes and the comma-free codes. Remember that: (i) the maximal size of strong comma-free codes cannot exceed 9 trinucleotides, and there are only 8 codes belonging to this class; and (ii) there are 408 maximal (size of 20 trinucleotides) comma-free codes. The 12,964,440 maximal circular codes have R F R values in the Thus, they have an ability to retrieve the reading frame weaker than the maximal circular code of the lowest R F R value 72.7 (%). On the other hand, the genetic code B 3 , obviously not circular, has a R F R probability of course equal to 1/3, as with all random codes (depicted in Fig. 6 ). The 4 unitary codes {A A A}, {CCC}, {GGG} and {T T T }, which are not circular, are also random codes with a R F R probability equal to 1/3. It is very interesting to point out from an amino acid coding point of view that the loss of the reading frame in a sequence of such a unitary code does not lead to the coding of an amino acid different from the one coded in the reading frame, in contrast to the 60 remaining unitary codes which are circular and comma-free (48 strong comma-free). In summary, the growing combinatorial hierarchy of k-circular trinucleotide codes is associated with a decreasing probability hierarchy of reading frame coding R F R.

The main property of circular codes which has been reported since 1996 is the nucleotide window length for retrieving the reading frame, e.g. 13 nucleotides with the circular code X in genes. The relation identified above shows that, from our point of view, the circular codes may retrieve the reading frame in genes according to two properties (property (i) being classic in coding theory, property (ii) being a new proposition): (i) always retrieved, i.e. without error, using a nucleotide window length, but then with a slow process; and (ii) retrieved with high frequencies, i.e. not always as some errors may occur, within two successive codons, thus with a fast process.

In the present work, we found for all possible sizes n of a given finite alphabet and for all word lengths , the (sharp) boundary k(n, ) from which the k(n, )-circularity of an -letter code X over implies its circularity. This result is important: obviously from a mathematical point of view, and also from a biological perspective.

The theoretical work developed here opens several biological research themes which can be investigated in the future: a classification of genes according to circular and k-circular code motifs as functionally conserved ("ancestral") genes may contain more circular code motifs compared to functionally specific genes; a construction of phylogenetic trees and alignments using these classes of motifs; a localization of circular and k-circular code motifs in a gene as the circular code motifs may be associated with regions in a gene where the ribosomal translation accuracy is important, e.g. after or before a splice site in the eukaryotic genes; the amino acid coding may also differ between these different circular code motifs; and many more.

In order to complete the proof of Theorem 4.4, we need to give a detailed proof of Lemma 5.2 where only the main construction steps were indicated. Recall that, given an integer n ≥ 2, we have n = {0, . . . , n − 1}.

Lemma I.1 Let be an odd integer greater than 2. There is an -letter code over 2 that is (2 ( −1)/2 − 1)-circular but not circular.

Proof Set k := k(2, ) = 2 −1 2 . As mentioned in Lemma 5.2, we show that there exists a binary code X that is (k −1)-circular but not circular, and more specifically that there exists a code X such that G(X ) contains a unique cycle, of length 2k. For a ∈ {0, 1}, we set a := 1 − a and we call a the complement of a. If w is a binary word, the complement of w is the word w obtained from w by complementing each of its letters.

Let s := −3 2 , and let S ⊆ {0, 1} s be a given subset of binary words of length s. We construct the code X S associated with S as mentioned in the sketch of the proof of Lemma 5.2 given earlier:

We observe that if ∈ {3, 5} (and hence s ∈ {0, 1}), then the code X ∅ is indeed (k − 1)-circular but not k-circular, for instance by applying Theorem 3.3.

In the sequel, we prove that if S ⊆ 01{0, 1} s−2 , then all but one components of G(X S ) are acyclic and that there is a choice of S such that the remaining component consists of one cycle only. This then yields the desired result. We thus assume from now on that ≥ 7, and hence s ≥ 2. Proof Recall that /2 = s + 1. In the middle component C s+1 of G(X ∅ ), every possible (s + 1)-node a · y · a with a, a ∈ {0, 1} and y ∈ {0, 1} s−1 is present and has exactly one in-going arc, which comes from the (s + 2)-node a · a · y · a , and one outgoing arc, which goes to the (s + 2)-node a · y · a · a. On the other hand, every (s + 2)-node of the form a · y · a (so with complementary first and last letters and where |y| = s) is present and has exactly one in-going arc, which comes from the (s + 1)-node a · y, and one outgoing arc, which goes to the (s + 1)-node y · a. So the middle component is a 2-regular digraph (that is, 1-out-regular and 1-in-regular), and hence a union of cycles of total length 2 × 2 s+1 = 2k.

Let i ∈ {1, . . . , s}. Suppose, contrary to the statement, that v is an ( − i)-node in C i that is neither a source nor sink. It follows that C i contains an arc e = u → v and an arc e = v → u , where u and u are i-nodes. The arc e corresponds to a word a · y · a · y · a with |y| = s. Let us write y = y 1 · y 2 with |y 1 | = i − 1, so that u = a · y 1 and v = y 2 · a · y 1 · y 2 · a. Similarly, the arc e corresponds to a word a · y · a · y · a with y = s. Writing y = y 1 · y 2 with y 2 = i − 1, we obtain u = y 2 · a and v = a · y 1 · y 2 · a · y 1 . Consequently, y 2 · a · y 1 · y 2 · a = a · y 1 · y 2 · a · y 1 , so in particular y 2 · a = a · y 1 and y 2 · a = a · y 1 . This yields the double contradiction that the first letter of y 2 should be equal both to a and to a and that the last letter of y 1 should be equal both to a and to a. This ends the proof of (B).

(C). If S ⊆ 01{0, 1} s−2 , then all components of G(X S ) but the middle one are acyclic.

Proof Let S ⊆ 01{0, 1} s−2 and consider the component C i of G(X S ), where i ∈ {1, . . . , s}. Suppose that C i contains a path u → v → u of length 2 where u and u are i-nodes (possibly equal), and hence v is an ( − i)-node. By (B), at least one of these two arcs comes from a word of the form a · y · a · y · a with y ∈ S and a ∈ {0, 1}. Proceeding exactly as in the proof of (B), we infer that, actually, both arcs come from such words. So let (y, y ) ∈ S 2 such that u → v corresponds to the word a · y · a · y · a and v → u corresponds to the word a · y ·a · y ·a . Write y = y 1 · y 2 and y = y 1 · y 2 with |y 1 | = i − 1 = y 2 . It follows that u = a · y 1 , u = y 2 · a and y 2 · a · y 1 · y 2 · a = v = a · y 1 · y 2 · a · y 1 .

(I.1)

In particular, y 1 = y 2 and hence u = y 1 · a .

(I.2)

We next observe that |y 1 | > 0, that is, i ≥ 2. Indeed, if i = 1, then y 1 , y 2 ∈ S. Since S ⊆ 01{0, 1} s−2 , we know that y 2 · a = a · y 1 (they differ on the second letter), which contradicts (I.1). We now prove that u = 0 i . Suppose first that i ≥ 3. Then |y 1 | ≥ 2 so y 1 starts with 01 as y 1 ·y 2 ∈ S. Therefore, u starts with 01. Suppose now that i = 2. Then y 1 = 0 and y 2 starts with 1. Since y 2 · a = a · y 1 , it follows that a = 1 and hence u = 0 i . Suppose now, for a contradiction, that C i contains a cycle C. Every i-node on C is the last vertex of a path of length 2 (possibly closed) contained in C. Therefore, all i-nodes on C contain a 1. It follows from this that C contains an i-node that starts with 1. Indeed, let u = 0 j 1 · y be an i node on C, with j ≥ 1 and |y| = i − j − 1. Then, by (I.2) the i-node u that is at directed distance 2 from u on C starts with 0 j−1 1. The assertion follows by (finite) induction. However, every i-node on C starts with the prefix of an element of S, and therefore cannot start with a 1. This contradiction concludes the proof of (C).

There exists S ⊆ 01{0, 1} s−2 such that the middle component of G(X S ) is connected. We need the following property to establish (D). (D1) Assume that there exist S ⊆ {0, 1} s and y ∈ {0, 1} s \S such that the middle component of G(X S ) contains two different cycles C 0 and C 1 such that y is a prefix of an (s + 1)-node in C i for each i ∈ {0, 1}. Then, the middle component of G(X S+y ) contains one cycle fewer than that of G(X S ).

Proof Note that C 0 and C 1 are necessarily disjoint. Without loss of generality, assume that y · a is an (s + 1)-node on C a for a ∈ {0, 1}. Because y / ∈ S, it follows from the definition of X S that for each a ∈ {0, 1}, the cycle C a contains the path a · y → a · y · a → y · a. By the definition of X S+y , the only change between the middle components of G(X S ) and G(X S+y ) consists in the replacement of these two paths by a · y → a · y · a → y · a for each a ∈ {0, 1}, see Fig. 7 . This merges C 0 and C 1 into a single cycle containing all i-nodes in C 0 and C 1 and leaves the other cycles in the middle component unchanged. The conclusion of (D1) follows.

We analyse the following algorithm to finish the proof of (D). Given a set S, let Y S be the set of vertices y ∈ {0, 1} s \S satisfying the hypothesis of (D1). For w ∈ {0, 1} s+1 , let C S (w) be the set of all the (s +1)-nodes in the same connected component of G(X S ) as w. In particular, this component is a directed cycle of even length, alternating between (s + 1)-nodes and (s + 2)-nodes. If w 1 and w 2 belong to C S (w), then w 2 is consecutive to w 1 if the (directed) distance from w 1 to w 2 on this directed cycle is 2. The algorithm is as follows.

We start by setting S := ∅. Then, while there exists y in Y S ∩ 01{0, 1} s−2 that is the prefix of a word in C S (0 s+1 ), we add y to S.

Note that the algorithm terminates. From now on, we let S be the set returned by this algorithm. (D2) Set C 0 := C S (0 s+1 ). Assume that each y ∈ 01{0, 1} s−2 appears either twice or never as the prefix of an (s + 1)-node in C 0 . Then C 0 contains all (s + 1)-nodes and hence has length 2k.

Proof For an (s + 1)-node w = a 1 · · · a s+1 ∈ {0, 1} s+1 and an integer j ∈ {0, . . . , s + 1}, we define the complement j-shift of w to be the (s +1)-node a j+1 · · · a s+1 a 1 · · · a j . So w is its own complement 0-shift and the complement w of w is the complement (s + 1)-shift of w. We notice that if w is an (s + 1)-node, then the node consecutive to w on C ∅ (w) is the complement 1-shift of w. It thus follows (by finite induction) that the complement j-shift of w also belongs to C ∅ (w) for each j ∈ {0, . . . , s + 1}.

In particular, C ∅ (w) is stable under taking complements and contains a word starting with 01 for every w ∈ {0, 1} s+1 .

Another useful remark is the one made at the end of the proof of (D1), that C S (x) ⊆ C S+y (x) for every x ∈ {0, 1} s+1 and y ∈ Y S . Consequently,

Since C ∅ (w) is stable under taking complements for every w ∈ {0, 1} s+1 , we deduce from (I.3) that so is C 0 . Therefore if y satisfies the hypothesis of (D2), then so does y, and hence all words in 10{0, 1} s−2 satisfy the hypothesis of (D2).

Let w := (01) (s+1)/2 if s is odd and w := 1(01) s/2 otherwise. Our next goal is to demonstrate that C 0 contains w. From this, we shall deduce that C 0 contains all (s +1)nodes starting with 01, and as a consequence all (s + 1)-words. We prove by induction on j ∈ {0, . . . , s−1} that C 0 contains the word w j , defined to be (01) j/2+1 1 s− j−1 if j is even and 1(01) ( j+1)/2 1 s− j−1 otherwise. First, w 0 ∈ C 0 by (I.3) since w 0 ∈ C ∅ (0 s+1 ). Suppose now that j ≥ 1 and w j−1 ∈ C 0 . Write w j−1 = w j−1 · 1. Because w j−1 is the complement 1-shift of 0 · w j−1 , and C 0 is stable under complement shifts, we know that 0 · w j−1 ∈ C 0 . (Indeed, the complement 1-shift of any (s + 1)-word W is the complement s-shift of the complement of W .) Since w j−1 starts with either 01 or 10, we know by assumption that w j−1 · 0 must belong to C 0 . Consequently, the stability of C 0 under taking complement 1-shifts implies that 1·w j−1 ∈ C 0 . Since w j ∈ {0·w j−1 , 1·w j−1 }, it follows that w j ∈ C 0 , which finishes the induction. We conclude that w belongs to C 0 since w = w s−1 .

A similar argument now allows us to show that C 0 contains all (s + 1)-nodes that start with 01. Indeed, let x ∈ {0, 1} s−1 . We want to show that 01 · x ∈ C 0 . For every j ∈ {1, . . . , s − 1}, let x j be the factor of length s + 1 of w · x that starts at the jth letter. We prove by induction on j ∈ {1, . . . , s − 1} that x j belongs to C 0 . We have seen that x 1 = w belongs to C 0 . Suppose that j ∈ {2, . . . , s − 1} and x j−1 belongs to C 0 . Writing x j−1 = a · x j−1 with a ∈ {0, 1}, the (s + 1)-node consecutive to x j−1 on C 0 is either x j−1 · 0 or x j−1 · 1. Since x j−1 starts with either 01 or 10, we know by assumption that both x j−1 · 0 and x j−1 · 1 belong to C 0 . Since x j is one of these, this finishes the induction. It follows that 01 · x = x s−1 ∈ C 0 , and hence C 0 indeed contains all words in 01{0, 1} s−1 .

We are now ready to prove that C 0 contains all (s + 1)-nodes. Indeed, if w is an (s + 1)-node, then (I.3) implies that C S (w) contains an (s + 1)-node starting with 01 since C ∅ (w) does, and hence C S (w) = C 0 . This concludes the proof of (D2).

It follows from (D2) that the set S constructed by the algorithm satisfies the conclusion of (D). It follows from (C) and (D) that G(X S ) contains a unique cycle, of length 2k. This concludes the proof of Lemma I.1.

Here are the possible sets S provided by the algorithm in the proof of Lemma I.1 when 3 ≤ ≤ 15: 

Lemma I.2 If n and are integers greater than 2 and is odd, then there is an -letter code over n that is (n ( −1)/2 − 1)-circular but not circular.

Proof As before, we set s := ( − 3)/2 and k := k(n, l) = n ( −1)/2 . As mentioned in the proof of Lemma 5.2, we shall define a mapping ϕ from s n to the family of 3-letter codes over n that has some special desired properties. We set X ϕ := a · y · b · y · c : y ∈ s n and abc ∈ ϕ(y) .

Notice that in the binary case, we had ϕ(y) = {010, 101} if y / ∈ S and ϕ(y) = {000, 111} otherwise. Our goal is to define ϕ such that X ϕ is (k − 1)-circular but not circular.

(E). Let a ∈ n and i ∈ {1, . . . , s}.

(1) If abb / ∈ y∈ s n ϕ(y) for every b ∈ n , then for every w ∈ i−1 n the component C i of G(X ϕ ) has no directed path of length 2 starting at a · w.

(2) If bba / ∈ y∈ s n ϕ(y) for every b ∈ n , then for every w ∈ i−1 n the component C i of G(X ϕ ) has no directed path of length 2 ending at a · w.

In particular, if at least one of (1) or (2) holds for every a ∈ n , then C i is acyclic.

Proof Suppose that v is an ( − i)-node in the component C i that is the middle vertex of a directed path of length 2. It follows that C i contains an arc e = u → v and an arc e = v → u , where u and u are i-nodes. The arc e corresponds to a word a · y · b · y · c with y ∈ s n and abc ∈ ϕ(y). Let us write y = y 1 · y 2 with |y 1 | = i − 1, so that u = a · y 1 and v = y 2 · b · y 1 · y 2 · c. Similarly, the arc e corresponds to a word a · y · b · y · c with y ∈ s n and a b c ∈ ϕ(y ). Writing y = y 1 · y 2 with y 2 = i − 1, we obtain u = y 2 · c and v = a · y 1 · y 2 · b · y 1 . Consequently, y 2 · b · y 1 · y 2 · c = a · y 1 · y 2 · b · y 1 , so in particular y 2 · b = a · y 1 and y 2 · c = b · y 1 . This implies that a = b and b = c, which yields both statements of (E).

Let D be a digraph. A vertex with exactly one neighbour in D (which can be either an out-neighbour or an in-neighbour) is a leaf of D. The digraph D is a hairy cycle if it is obtained from a directed cycle C by adding leaves linked to C. (F). If for every y ∈ s n the graph G(ϕ(y)) is a union of hairy cycles such that every 1-letter word belongs to a cycle, then the middle component of G(X ϕ ) is also a union of hairy cycles such that every (s + 1)-letter word belongs to a cycle.

Proof For every (s+2)-word w = a·y·b where |y| = s, the arcs incident to w in G(X ϕ ) are in natural bijection with the arcs incident to ab in G(ϕ(y)). It thus follows by assumption that in G(X ϕ ), the vertex w either is a leaf or w has exactly one in-neighbour and one out-neighbour. Consider now any (s +1)-word w = a · y = y ·a where |y| = s = y . By assumption, w has exactly one out-neighbour b · y · c such that a → bc belongs to a cycle of G(ϕ(y)) and, similarly, exactly one in-neighbour b · y · c such that b c → a belongs to a cycle of G(ϕ(y )). The conclusion of (F) follows.

Fix y ∈ s n . The code ϕ(y) is to be chosen among the family F: specifically, we define the 3-letter codes X 2,n , . . . , X n,n , which generate F by letter permutations. We make sure that for each i ∈ {2, . . . , n}, the code X i,n satisfies the assumptions of (E) and (F) and, in addition, is such that G(X i,n ) is the disjoint union of a hairy cycle of length 2i and of (n − i) hairy cycles of length 2.

We identify n to Z n and perform all arithmetic operations on letters in Z n , unless the letter is explicitly referenced as a member of another cyclic group Z i .

We now deal with the case n ≥ 4, the case where n = 3 being dealt with in the same manner, except with different sets X i,n given later on. We set 

For each j ∈ Z n , we define B j,n := ( j)( j + 1)( j), ( j + 1)( j)( j) , and for each i ∈ {1, . . . , n},

For example, when n = 5, we obtain (2) for each j ∈ Z n , the graph G(B j,n ) is the disjoint union of a hairy cycle of length 2 and an arc starting at j + 1; (3) the graph G(X i,n ) is the union of one hairy cycle of length 2i and n − i hairy cycles of length 2. Each 1-node contained in a (hairy) cycle of length 2 is called a fixed point of X i,n .

Proof (1) The statement holds if i ∈ {1, 2, 3, 4}, as one directly checks on the digraphs depicted in Figs. 8, 9 and 10. We now proceed by induction on i ∈ {4, . . . , n}, the statement being true if i = 4. Let i ∈ {5, . . . , n}. Notice that Y i+1 is obtained from Y i by removing the three words (i − 1)10, 0(i − 1)(i − 1), (i − 2)0(i − 1) and adding the five words (i −1)0i, i(i −1)(i −1), (i −2)i(i −1), i10 and 0ii. It follows that G(Y i+1 ) is obtained from G(Y i ) by replacing the path

along with the attached leaves as presented in Fig. 11 . The conclusion follows. (2) Since B j,n is obtained from Y 1 by applying the permutation x → j + x to every letter, the graph G(B j,n ) is isomorphic to G(Y 1 ), presented in Fig. 8 , with j + 1 being isomorphic to 1. Moreover, if j = j , then G(B j,n ) and G(B j ,n ) have disjoint sets of vertices unless j ∈ { j − 1, j + 1}, in which case the two graphs intersect on exactly one 1-node, which belongs to a cycle in exactly one of them. Therefore, G ∪ j∈Z n B j,n is the union of n hairy cycles, each containing exactly one 1-node.

(3) Observe that if i ∈ {1, . . . , n−1}, then Y i and B j,n are codes over disjoint alphabets whenever j ∈ {i, . . . , n − 2}. Note also that since i ≤ n − 1, the graphs G(Y i ) and G(B n−1,n ) intersect precisely on the 1-node 0, which is a leaf in G(B n−1,n ). The conclusion now follows from (1) and (2), which ends the proof of (G).

Notice that if = 3 (and hence s = 0 and k = n), then the code X ϕ is equal to ϕ(ε), where ε is the empty word; consequently, it suffices to choose ϕ(ε) = X n,n by (G).

We now assume that > 3, and hence s ≥ 1. We are ready to build ϕ, which we do in several steps. We let ϕ i be the function ϕ defined at step i. We start by defining an order on s n as follows. Given y ∈ s n , let max(y) be the largest letter occurring in y, and for a ∈ Z n let |y| a be the number of occurrences of a in y.

Let o(y) := max(y), |y| max(y) and fix a total order ≺ on s n compatible with the lexicographic order with respect to o. For every word w ∈ s+1 n , let C ϕ (w) be the set of all (s + 1)-nodes of the connected component of G(X ϕ ) that contains w. We write C i for C ϕ i 0 s+1 .

We set ϕ 0 (y) := j∈Z n B j,n for each y ∈ s n . At each step, we consider the smallest word y ∈ s n not considered yet according to the order defined above, and we obtain ϕ i+1 from ϕ i by changing ϕ i (y). In addition to the properties mentioned before, we make sure that the following invariant is satisfied.

(H). When we consider a word y ∈ s n , the set C i is composed of all the circular shifts of the words a · y for y ≺ y and a ∈ n .

We thus start with 0 s . Note that C 0 = {0 s+1 }. We set ϕ 1 (0 s ) := X n,n . By (F) and (G), we know that C 1 contains all circular shifts of a ·0 s for a ∈ n . We now consider step i: let y ∈ s n such that each y ≺ y has been considered. Let A := {a ∈ n : a · y ∈ C i }. Notice that if a ∈ A, then all circular shifts of a · y belong to C i thanks to (H). Furthermore, 0 ∈ A. Indeed, let b := max(y) and write y = y 1 ·b· y 2 . Because 0 s ≺ y, we know that b = 0. Consequently y 2 · 0 · y 1 ≺ y. Therefore, (H) implies that all circular shifts of b · y 2 · 0 · y 1 belong to C i , so in particular 0 · y does.

Let m := |A| − 1. We let ϕ i+1 (y) be the code obtained from X n−m,n by permuting the letters such that the set of fixed points is A\{0}. We need to prove that C i+1 is still stable under circular shifts and contains a · y for each a ∈ n .

To see this, first note that for every (s +1)-node w, if the arcs incident to w in G(X ϕ i ) and in G(X ϕ i+1 ) are not the same, then y is a prefix or a suffix of w. It follows that the only arcs incident to a given node in C i that may change at step i are those forming a path of the form a · y → (a + 1) · y · a → y · a for some a ∈ A. If a ∈ A\{0}, then a is a fixed point of ϕ i+1 (y), and hence there still is a path of length 2 from a · y to y · a in G(X ϕ i+1 ). It follows that C i ⊆ C i+1 . Let a ∈ n \A and consider w = y · a, so that w / ∈ C i . By (H), this implies that every s-factor of w is greater than or equal to y. Therefore, writing 0, a 1 , . . . , a n−m−1 the 1-letter words contained in the cycle of length 2(n − m) of G(ϕ i+1 (y)) (in cyclic order), we infer that G(X ϕ i+1 ) contains a directed path going (in order) through the following (s + 1)-nodes: 0 · y, y · a 1 , . . . , a 1 · y, y · a 2 , . . . , a 2 · y, . . . , a n−m−1 · y, y · 0.

Consequently, C i+1 satisfies (H).

We let ϕ be ϕ n s+1 . We observe that for every y ∈ Y , the code ϕ(y) contains no word of the form aab, and hence by (E) the component C i is acyclic for each i ∈ {1, . . . , s}. Moreover, by (F) and (G) the middle component C s+1 of G(X ϕ ) is a union of hairy cycles. By (H), all (s + 1)-nodes belong to C ϕ (0 s+1 ), and hence C s+1 is composed of a unique hairy cycle, of length 2k. This concludes the case where n ≥ 4.

When n = 3, we keep the same procedure and always choose ϕ(y) among one of the following four codes (without any letter permutation). We observe that (E) and (G) are still satisfied by these codes, which ends the proof of Lemma I.2.

Theorem 4.4 is now implied by Lemmas 5.1, I.1 and I.2. We conclude by giving three examples. Example I. 3 (1) If = 5 and n = 3, then s = 1 and the order ≺ over 3 can be chosen to be the natural order over integers. One has ϕ(0) = ϕ(1) = X 3,3 and ϕ(2) = X 2,3 . The obtained code is then {00001, 00101, 10002, 00202, 20102, 10200, 01011, 01111, 11012, 01212, 21112, 11210, 02022, 02222, 22122, 12220 , 12021, 02121}.

(2) If = 7 and n = 3, then s = 2 and the order ≺ over 2 3 can be chosen to be 00 ≺ 01 ≺ 10 ≺ 11 ≺ 02 ≺ 20 ≺ 12 ≺ 21 ≺ 22.

One has ϕ(00) = X 3,3 , ϕ ( 01) = X 3,3 , ϕ ( 10) = X 2,3 , ϕ(11) = X 3,3 , ϕ ( 02) = X 2,3 , ϕ ( 20) = X 1,3 , ϕ(12) = X 2,3 , ϕ ( 21) = X 1,3 , ϕ ( 22) = X 2,3 .

(3) If = 7 and n = 4, then s = 2 and the order ≺ over 2 4 can be chosen to be 00 ≺ 01 ≺ 10 ≺ 11 ≺ 02 ≺ 20 ≺ 12 ≺ 21 ≺ 22 ≺ 03 ≺ 30 ≺ 13 ≺ 31 ≺ 23 ≺ 32 ≺ 33.

Let X 3,4 and X 2,4 be, respectively, obtained from X 3,4 and X 2,4 by permuting 1 and 3 so that their respective sets of fixed points are {1} and {1, 2}. One has ϕ(00) = X 4,4 , ϕ(01) = X 4,4 , ϕ(10) = X 3,4 , ϕ(11) = X 4,4 , ϕ(02) = X 3,4 , ϕ (20) = X 2,4 , ϕ(12) = X 3,4 , ϕ(21) = X 2,4 , ϕ(22) = X 3,4 , ϕ(03) = X 2,4 , ϕ(30) = X 1,4 , ϕ(13) = X 2,4 , ϕ(31) = X 1,4 , ϕ(23) = X 2,4 , ϕ(32) = X 1,4 , ϕ(33) = X 2,4 .

Asp Ile Thr Tyr Glu Ser His Leu Ala Pro Arg Cys Val Code: 14 D 1

Extremal digraphs and comma-free codes

Codes and automata. Encyclopedia of mathematics and its applications

Evolutionary conservation and functional implications of circular code motifs in eukaryotic genomes

Circular code motifs in the ribosome: a missing link in the evolution of translation

Circular code motifs in the ribosome decoding center

Circular code motifs near the ribosome decoding center

Circular codes, symmetries and transformations

Strüngmann L (2016) n-nucleotide circular codes in graph theory

Diletter circular codes over finite alphabets

Strong comma-free codes in genetic information

On the hierarchy of trinucleotide n-circular codes and their corresponding amino acids

Maximal dinucleotide comma-free codes

Mathematical Fundamentals for the noise immunity of the genetic code

The genetic code is one in a million

Comma-free codes

Construction and properties of comma-free codes

Frozen accident pushing 50: stereochemistry, expansion, and chance in the evolution of the genetic code

Origin and evolution of the genetic code: the universal enigma

Completing comma-free codes

A 2006 review of circular codes in genes

Circular code motifs in transfer and 16S ribosomal RNAs: a possible translation code in genes

Circular code motifs in transfer RNAs

A genetic scale of reading frame coding

The maximal C 3 self-complementary trinucleotide circular code X in genes of bacteria, eukaryotes, plasmids and viruses

The maximal C 3 self-complementary trinucleotide circular code X in genes of bacteria, archaea, eukaryotes, plasmids and viruses

Enrichment of circular code motifs in the genes of the yeast Saccharomyces cerevisiae

Identification of all trinucleotide circular codes

A permuted set of a trinucleotide circular code coding the 20 amino acids in variant nuclear codes

Varieties of comma free codes

A relation between trinucleotide comma-free codes and trinucleotide circular codes

Identification of a circular code periodicity in the bacterial ribosome: origin of codon periodicity in genes?

The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides

Stereochemical relationship between coding triplets and amino-acids

A characterization for a set of trinucleotides to be a circular code

Order in the genetic code

A co-evolution theory of the genetic code

The genetic code and RNA-amino acid affinities

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations