key: cord-0428482-jy0j9r5f authors: Zhang, Shu; Pu, Lianrong; Yang, Runmin; Wang, Luli; Zhu, Daming; Jiang, Haitao title: Longest Order Conserved Exemplar Subsequences date: 2020-12-15 journal: bioRxiv DOI: 10.1101/2020.12.15.422841 sha: ef79bc004760a9328e81b22476cc52a4f511b8ab doc_id: 428482 cord_uid: jy0j9r5f We propose a new problem whose input data are two linear genomes together with two indexed gene subsequences of them, which asks to find a longest common exemplar subsequence of the two given genomes with a subsequence identical to the given indexed gene subsequences. We present an algorithm for this problem such that the algorithm is allowed to take diminishing time and space to solve the problem by setting the indexed genes with an incremental number. Although an incremental number of indexed genes were selected, the algorithm was verified definite to reach a solution whose length insistently comes very close to a real longest common exemplar subsequence of the two given genomes. Aiming at 23 human/gorilla chromosome pairs, the algorithm was examined for use in questing for longest common exemplar subsequences whose basic units are annotated genes as well as pseudo genes, namely consecutive DNA subsequences. By contrasting the pseudo gene common exemplar subsequences the algorithm had reached for the human chromosomes 7 and 16 and their gorilla homologues with those annotated genes in the human and gorilla chromosomes, we found more than 1 000 and 500 pseudo genes in the human chromosomes 7 and 16 that occur in the same order as they are in the gorilla chromosomes 7 and 16 and, do not overlap with any annotated gene. Author summary There is a benefit of the algorithm: It can reach a long enough common exemplar subsequence of two linear genomes in as fast a speed as one requires even if the given genomes would be equipped with too many duplicated genes, which can be done by setting incremental number of indexed genes. We developed a Java software based on the algorithm, that has been available for download on https://github.com/ShuZhang-sdu/LCES. Only in need to set the indexed gene sequences as null, was it verified successful for our algorithm to obtain the longest common exemplar subsequences of the annotated gene summary pairs extracted from 23 human/gorilla chromosome pairs. In convenience for researchers to find new motifs or conserved genes, we devoted for the algorithm to quest pseudo gene (i.e. consecutive DNA subsequences) summary pairs of the 23 human/gorilla chromosome pairs for solutions. There are 20 pseudo gene summary pairs whose longest common exemplar subsequences have been found by the algorithm with null indexed gene sequences. The other 3 pseudo gene summary pairs were verified solvable for the algorithm to reach their longest common exemplar subsequences that have to admit subsequences identical to given indexed gene subsequences. There were informed to exist 2 353 and 1 148 pseudo genes in the gorilla chromosome 7 and 16 that occur in the same order as they are in the human chromosome 7 and 16 and, do not overlap with any annotated gene. These pseudo genes should be significant for annotating the human or gorilla genome. 1 Conserved gene based molecular mechanism analysis happens basically in kinds of 2 bioresearch such as breeding soybean seeds [1] as well as developing drugs or vaccines 3 to beat back corona viruses [2] [3] [4] . Finding conserved gene subsequences in 4 genomes has been an attractive topic in biomedicine, bioinformatics and computer 5 science all the time. A consecutive DNA subsequence is not necessarily an annotated 6 gene if it is conserved. Thus a sequence of conserved consecutive DNA subsequences in 7 a genome sounds to have a broader range of applications such as for people to find 8 motifs in it or illuminate bio-functions of it. An order conserved sequence of conserved 9 consecutive DNA subsequences or genes can be found by solving a combinatorial 10 problem in terms of genome comparison [5] [6] [7] . 11 The exemplar breakpoint distance problem (abbr. EBD) proposed by Sankoff is a 12 pioneering problem for this purpose [5] . Some heuristic algorithms using branch and 13 bound [5] or divide and conquer [8] can be found effective for this problem. This 14 problem is NP-Hard [9] and there are extensive approaches on the computational 15 complexity of this problem [10] [11] [12] . Some more algorithmic progresses for special 16 versions of this problem can be found in the literatures [13] [14] . The decision version 17 of EBD, the zero exemplar breakpoint distance problem (abbr. ZEBD), remains 18 NP-Hard [11] and admits an O(n 2 1.86 n ) time algorithm for the two given genomes 19 both to allow at most 2 identical genes where n is the number of genes in the given 20 genomes [15] . Some special versions of ZEBD admit polynomial time 21 algorithms [11] [15] [16] . 22 In the sense of order conserved, a long enough common exemplar subsequence of 23 two or more linear genomes should be better qualified for conserved. The exemplar 24 longest common subsequence problem (abbr. ELCS) was proposed by Bonizzoni et 25 al. [6] to quest for a longest common subsequence of two or more linear genomes that 26 contains a gene of each mandatory gene family. This problem is NP-Hard since ZEBD 27 is a special version of it and, admits a polynomial time algorithm for the special 28 version whose input is of two linear genomes with at most three genes of each 29 mandatory family [6] . This problem admits fixed-parameter algorithms for instances 30 of two linear genomes [6] . 31 The repetition-free longest common subsequence problem (abbr. RFLCS) proposed 32 by Adi et al. [7] is in fact a special version of ELCS. This problem is NP-Hard since B) in exclusion of those identical to genes in X (resp. Y ) admits a span s(A, X) (resp. 66 s(B, Y )), then the time complexity and the space complexity of the algorithm are as Y with increasing number of genes. Using the algorithm for LCES-IG, we developed a software that have been ready 75 for finding a longest common exemplar subsequence of two linear genomes A and B, 76 which admits a subsequence identical to a given subsequence X of A as well as a given 77 subsequence Y of B. Aiming at 23 human/gorilla chromosome pairs, we performed 78 experiments where the algorithm was employed to quest for their longest common 79 exemplar subsequences of annotated genes. By running the algorithm with X and Y 80 as null, we obtained the longest common exemplar subsequences of 23 annotated gene 81 summary pairs extracted from those 23 human/gorilla chromosome pairs. In expectation of avail to find new conserved genes or new motifs in human as well 83 as gorilla genomes, we performed experiments where the algorithm was employed to 84 quest for longest common exemplar subsequences of pseudo genes, i.e., consecutive 85 DNA subsequences that occurs in a human chromosome and its gorilla homologue, 86 aiming at those 23 human/gorilla chromosome pairs. A modified version of 87 SDquest [23] was used to identify the consecutive DNA subsequences in those 23 88 human/gorilla chromosome pairs as pseudo genes so that 23 pairs of pseudo gene 89 sequences each of which will be called a pseudo gene summary were extracted from 90 those 23 human/gorilla chromosome pairs. For X and Y to be set with null, there are 20 pseudo gene summary pairs whose 92 longest common exemplar subsequences have been found by the algorithm. The other 93 3 pseudo gene summary pairs were all verified solvable for the algorithm to reach their 94 longest common exemplar subsequences that have to admit subsequences identical to 95 given indexed gene subsequences X and Y . Positively, one is allowed to select more 96 indexed genes in X as well as Y to make the algorithm take less time and space to 97 November 24, 2020 3/24 reach a solution whose length can always come very close to a real longest common 98 exemplar subsequence of the two given genomes. This implies it constantly true for 99 the algorithm to reach an order conserved sequence of conserved consecutive DNA 100 subsequences in a genome in as short a running time as required in practice. 101 In convenience for researchers to mine something from those longest common [1, 5] ∥ π [6, 10] . There can be more than one gene of the same gene family in a linear genome or 128 gene sequence. Two genes are referred to as identical, if they are of the same gene 129 family. For a gene π[i] in π, we denote by occ(π, π[i]) the number of genes in π that 130 are identical to π[i]. We refer to j − i for i < j as the span of π[i] and π[j] in π. The 131 span of a linear genome refers to the maximum span of two identical genes in that 132 linear genome. Let π = 1 4 3 2 4 1 5 2 6 for example. Then occ(π, 1) = occ(π, 2) = 2, 133 occ(π, 3) = 1 and the span of π is 5, because π [1] and π [6] are identical and their span 134 5 is maximum over all two identical genes in π. 135 We denote it x = y for two genes x and y to be identical. Two gene sequences X = 136 X [1, p] 137 A sequence of genes is referred to as a subsequence of a linear genome, if it can be 138 obtained by deleting some genes (not necessarily consecutive) from the linear genome. 139 A subsequence of a linear genome on Σ is referred to as an exemplar subsequence of 140 the genome, if any two genes in this subsequence are not identical. problem is what we were just asked. Instance: Linear genomes G 1 , G 2 , ..., G N ; exemplar subsequences X 1 , Question: Find a longest common exemplar subsequence of G 1 , ..., G N indexed by 166 X j for 1 ≤ j ≤ N . 167 We are going to mention this problem as LCES-IG for short and abbreviate by an 168 LCES a longest common exemplar subsequence of some linear genomes. In what follows of this section, we present a dynamic programming algorithm for 170 LCES-IG whose instances are of two linear genomes and their respective exemplar 171 subsequences. We represent an LCES-IG instance by two linear genomes A = A [1] ... , if identical to a gene in X (resp. Y ) and 175 other than the gene, cannot occur in any common exemplar subsequence indexed by 176 X (resp. Y ). Thus A (resp. B) is always assumed with no other gene identical to 179 For i and j with 0 ≤ i ≤ m and 0 ≤ j ≤ n, let C(i, j) denote the set of all common 180 exemplar subsequences of A [1, i] and B [1, j] . A member C ∈ C(i, j) is referred to as 188 be indexed by X [1, k] as well as Y [1, k] , then any extension of the member cannot be 189 indexed by X as well as Y . Proof. Let C 1 ∈ C(i, j), C ∈ C(m, n). If C is an extension of C 1 , then there is a It follows from occ(A, A[x p ]) = 1 for 1 ≤ p ≤ k that C 2 cannot be indexed by X[p] 194 for 1 ≤ p ≤ k. If C 1 fails to be indexed by X [1, k] as well as Y [1, k] , so does C. 195 It follows by Lemma 1 that if i ≥ x k and j ≥ y k , anyone in C(i, j) can be given up 196 to maintain for extension if it fails to be indexed by X [1, k] as well as Y [1, k] . This 197 implies that if i ≥ x k+1 and j ≥ y k+1 for 1 ≤ k + 1 ≤ q, then there is no need to 198 maintain for extension any one in C(i, j) that is indexed by X [1, k] as well as Y [1, k] 199 instead of X[1, k + 1] as well as Y [1, k + 1]. On the other hand, if i ≥ x k+1 and j < 200 y k+1 (resp. i < x k+1 and j ≥ y k+1 ) for 1 ≤ k + 1 ≤ q, then it follows by Lemma 1 201 that anyone in C(i, j) can be given up to maintain for extension. Thus only when i 202 and j fall in such intervals as x k ≤ i < x k+1 and y k ≤ j < y k+1 , is it necessary to 203 maintain for extension those common exemplar subsequences in C(i, j) indexed by 204 X [1, k] as well as Y [1, k] , For i and j with x k ≤ i < x k+1 and y k ≤ j < y k+1 (0 ≤ k ≤ q), we focus on how 212 to get CP (i, j). Since A[1, 0] = "" and B[1, 0] = "", we set CP (x 0 , y 0 ) = CP (0, 0) = 213 {""}. The following lemma is in terms of how to get CP (x k , y k ) for k ≥ 1. i ≥ x k (resp. j ≥ y k ) where 1 ≤ k ≤ q, a member in C(i, j) fails to205 Let x 0 = y 0 = 0, x q+1 = m + 1, y q+1 = ny k ≤ j < y k+1 where 0 ≤ k ≤ q, CP (i, j) {C ∈ C(i, j) | C is indexed by X[1, k] as k with 1 ≤ k ≤ q, a common exemplar subsequence C ∈ C(x k , y k ) is 215 indexed by X[1, k] as well as Y [1, k], if and only if there exists C ′ ∈ C(x k − 1, y k − 1]) 216 indexed by X[1, k − 1] as well as Y [1, k − 1] such that C = C ′ ∥ A[x k ]. 217 Proof. If : Let C ′ ∈ C(x k − 1, y k − 1). It follows from occ(A, A[x k ]) = occ(B, B[y k ]) 218 = 1 that no gene identical to A[x k ] or B[y k ] can occur in C ′ . It follows from A[x k ] = 219 B[y k ] that C ′ ∥ A[x k ] ∈ C(x k , y k ) and if C ′ is indexed by X[1, k − 1] as well as Y [1, 220 k − 1], then C ′ ∥ A[x k ] is indexed by X[1, k] as well as Y [1, k]. Only if : Let C ∈ C(x k , y k ) be indexed by X [1, k] as well as Y [1, k] . . It remains to argue the way of how to get CP (i, j) from CP (i, denote an arbitrary common exemplar subsequence of 236 A [1, i] and B [1, j] indexed by X [1, k] as well as Y [1, k] . 248 that can be argued true in the same way as in case (1.1). Since i = x k , no member in 249 CP (i − 1, j) can be indexed by X [1, k] as well as Y [1, k] To sum up, CP (i, j) can be computed recursively by Formula (1) for i and j with i 254 The lemma follows from that the 272 longest extension of C 1 in C(m, n) has no less genes than |C 1 ∥ C 3 |. 273 By Lemma 3, to get an LCES of A and B indexed by X as well as Y , it suffices for 274 the dynamic programming to maintain a subset of CP (i, j) whose common exemplar It follows from CP (0, 0) = {""} that CF P (0, 0) = {""}. Then for i and j with i + 287 j > 0, x k ≤ i < x k+1 and y k ≤ j < y k+1 (0 ≤ k ≤ q), we face with the task of getting 288 a minimum representative subset of CP (i, j) from CF P (i − 1, j), CF P (i, j − 1) and . Then we argue for CF P to be 300 representative in the following two aspects. The reason why CF P is minimum over all representative subsets of subset of CP (i, j) can be extracted by examining every two members in a subset of 336 Formally, the confused gene family set of C can be got as follows. where it is treated as what occurs in C ∈ CP (i, j). Thus to get the confused gene 343 family set of C ∈ CP (i, j), it suffices to decide whether the gene family of . This can be argued in two subcases. In summary of (2.1) and (2.2), f (i, j, C) can be got by the following formula. 353 Let F P (i, j) denote the confused gene family set collection of CF P (i, j). Since To extract F P (i, j), it needs help of the lengths of those members in CP (i, j, x k+1 and y k ≤ j < y k+1 (0 ≤ k ≤ q), the CES length of an arbitrary confused gene can 378 be computed recursively by Formula (5). 379 accompanied with a CES length, F P (i, j) can be arrived at recursively by Formula (6) . 388 This leads to a dynamic programming based algorithm to get F P (m, n). Since all 1: x 0 ← 0; y 0 ← 0; x q+1 ← m + 1; y q+1 ← n + 1; 2: F P (i, 0) = F P (0, j) ← {∅}, 0 ≤ i ≤ m, 0 ≤ j ≤ n; 3: for k from 0 to q do 4: for i from x k to x k+1 − 1 do 5: for j from y k to y k+1 − 1 do 6: Get F P (i, j) by (3) The span of two identical genes in a linear genome has usually been used as a 399 parameter to design efficient algorithms [11] [14] . There have been approaches on 400 comparing human and mouse genomes which provide hints for identical genes in a 401 genome to occur in limited spans [24] . Since so, let s(A, X) (resp. s(B,Y )) denote the 402 span of A (resp. B) in exclusion of those genes each of which is identical to someone in 403 X (resp. Let for N ≥ 2, G 1 , G 2 , ..., G N be N linear genomes on Σ whose lengths are n 1 , n 2 , ..., 431 n N respectively. Let i 2 , ..., i N ) . To get F P (i 1 , 446 i 2 , ..., i N ), it has to access 2 N − 1 confused gene family set collections. There are at ≤ i ≤ n 1 . Then a member in CP (i 1 , i 2 , ..., i N ) admits a confused gene family set of at 449 most s gene families and there are at most 2 s members in F P (i 1 , i 2 , ..., i N ). In order 450 to go from F P (0, 0, ..., 0) to F P (n 1 , n 2 , ..., n k ), there are Σ q+1 i=1 ( confused gene family set collections to be involved into computation, where x j 0 = 0 452 and x j q+1 = n j + 1 for 1 ≤ j ≤ N . Table 1 . chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 human 5475 4200 3188 2657 2988 3064 3014 2485 2333 2336 3364 3055 gorilla 2947 2022 1716 1223 2094 1542 1440 1107 1165 1127 1728 1506 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX human 1402 2287 2219 2558 3059 1244 2991 1458 875 1388 2425 gorilla 611 1102 991 1144 820 474 1668 787 307 645 1315 A "huamn" or "gorilla" statistic represents the gene number of a human or gorilla chromosome summary. Those 23 human/gorilla chromosome summary pairs were all verified solvable for 499 LCES(A, B, X, Y ) with X = Y = "" to reach their LCESs. These LCESs have been 500 prepared ready in S1 File. Since the length of an LCES of two chromosome summaries 501 reflects the structure similarity of them, we present in Table 2 the first row the lengths 502 of those 23 LCESs. Moreover, we present in Table 2 A "length" statistic represents the gene number of an LCES. An "lces/lh" (resp. "lces/lg") statistic represents the ratio of the gen number of an LCES to the gene number of a human (resp. gorilla) chromosome summary. A "time(ms)" statistic represents the real running time. There exist 1 658 distinct named genes in the LCES for the algorithm to reach in 507 questing the summaries of human chromosome 1 and its gorilla homologue. It took To break through the restriction of genes that have been annotated, we turn attention 515 to find order conserved sequences of significant subsequences in human or gorilla 516 chromosomes without any regard to the annotation files. We refer to a consecutive 517 DNA subsequence as a pseudo gene, if it has at least a stated number (500) of DNA 518 bases and occurs in a human chromosome as well as its gorilla homologue. We start 519 with a DNA sequence represented human chromosome and its gorilla homologue to 520 pursue one of their pseudo gene sequence represented LCESs. A modified version of SDquest [23] was employed to extract two respective pseudo 522 gene subsequences from a human chromosome and its gorilla homologue. For SDquest 523 November 24, 2020 13/24 with input to be a human chromosome and its gorilla homologue, it will output two 524 pseudo gene sequence represented subsequences of the human chromosome and its 525 gorilla homologue respectively. SDquest was set to identify a consecutive DNA 526 subsequence in a human (resp. gorilla) chromosome as a pseudo gene if it has at least 527 500 DNA bases and occurs in the chromosome's gorilla (resp. human) homologue with 528 at least 95% identity 1 [23] . All pseudo genes for SDquest to identify were encoded 529 with integers and two pseudo genes were treated as identical and encoded with the 530 same integer, if they are with at least 95% identity. The "hg38" assembly [28] and the "gorGor4" assembly [29] A "Human" or "Gorilla" statistic represents the length of a human or gorilla pseudo gene summary. whose LCES lengths as well as the running time the algorithm took to reach them, are 547 given in Table 4 . We were informed out of memory by LCES(A, B, "", "") for it to 548 quest the other three summary pairs. 549 A "length" statistic represents the length of an LCES. A "time(ms)" statistic represents the real time for the algorithm to get an LCES. 1 95% identity means matches matches+mismatches+indels = 95%, where matches (resp. mismatches) is the number of matched (resp. mismatched) nucleotides and indels is the number of indels. What we did for this uncertainty is: For r to take value in (0, 1) increasingly, 576 randomly select a subsequence C ′ of C with r|C| pseudo genes and examine where each time of the algorithm's execution in a round was with 591 C ′ randomly selected from C and |C ′ | = r|C|. 592 In Table 5 , we present the respective successful numbers in all rounds of algorithm 593 executions. Those statistics in Table 5 algorithm execution was with C ′ randomly selected from C and |C ′ | = r|C|. 615 We present in Table 6 pseudo gene summary pairs. We present in Table 6 Moreover, we present in Table 6 The statistics in Table 6 show that although there are an increasing number of indexed 627 genes to be selected, there exists no downward trend on the lengths of solutions for the 628 algorithm to reach. The lengths of solutions the algorithm had reached are always 629 very close to the length of a real LCES. This implies that one is allowed to select more 630 indexed genes to make the algorithm take less time to reach a common exemplar 631 subsequence whose length can come very close to an LCES. As r takes value from 30% to 50%, the average running time for Index-LCES(A, B, 633 solutions. This might be because there are 12 identical pseudo gene pairs in both of 638 these two pseudo gene summaries whose spans are no less than 4730. For the pseudo gene summary pairs of the human chromosomes 7 and 16 and their 640 gorilla homologues and for r ∈ {10%, 15%, 20%, 25%, 30%}, we present in Table 7 Table 7 show that the running time the algorithm 646 had taken averagely to reach a solution rapidly decreases as the value of r increases, 647 instead of which, the average length of those common exemplar subsequences for the 648 algorithm to reach remains unchanged. genes. In S6 File (resp. S7 File), we have prepared ready the annotated genes in those 677 23 human (resp. 24 gorilla) chromosomes that overlap with some pseudo genes in 678 sequences in S4 File. A pseudo gene in a sequence in S4 File, if does not overlap with any annotated gene, 680 is valuable for people to illuminate its bio-function or significant for people to mine 681 motifs in it. So In S8 File (resp. S9 File), we have prepared ready those pseudo genes 682 that do not overlap with any annotated gene in human (resp. gorilla) chromosomes. One may be more interested in a pseudo gene, if it both does not overlap with any 684 annotated gene and is long enough. So in S10 File (resp. S11 File), we have prepared 685 ready those pseudo genes that do not overlap with any annotated gene in human as 686 well as gorilla chromosomes and each of them has at least 10 000 bases. concern because they were reached by Index-LCES(A, B, X, Y ) with some pseudo 691 genes fixed as indexed. Since the gorilla chromosome 2 we mentioned is a fusion of two 692 real gorilla chromosomes, we will give up to show more statistics about CES-2 in 693 contrast with the annotated gene sequences that were extracted from the human 694 chromosome 2 and its gorilla homologue. In Table 8 , we disclose more statistic 695 information about CES-7 and CES-16. In Table 8 the column with title AG (resp. PG), we present the numbers of 697 annotated (resp. pseudo) genes that were extracted from the human chromosomes 7 To identify new genes or motifs, the pseudo genes that do not overlap with any 710 annotated gene should be more attractive for people to pay attention to. In Table 8 711 the column with title PG∧AG, we present the numbers of pseudo genes in CES-7 and 712 CES-16 that do not overlap with any annotated gene extracted from the human 713 chromosome 7 and 16. We present in the column with title PG∧AG(10000) 714 additionally, the numbers of pseudo genes in CES-7 and CES-16 that were involved in 715 the column with title PG∧AG and each of them has at least 10 000 bases. In Table 8 716 the last column, we present the site intervals of the longest pseudo genes that do not 717 overlap with any annotated gene in CES-7 and CES-16 respectively. 718 We are informed by these statistics an order conserved sequence of more than 400 719 (resp. 1 000) pseudo genes all of which are of more than 10 000 bases and do not 720 overlap with any annotated gene in the human (resp. gorilla) chromosome 7. The same 721 two numbers for the human and gorilla chromosome 16 are 140 and 400 respectively. "humanpos" (resp. "gorillapos") indicates the identification number in A (resp. B) of 800 the pseudo gene that is in the same row as it lies in and the column with title 801 "indexedgene". Let X (resp. Y ) denote the subsequence of pseudo genes in A (resp. 802 B) whose identification numbers are given in the column with title "humanpos" (resp. 803 "gorillapos"). Then the LCES of A and B indexed by X as well as Y will be reached Encoding an APC8-like Protein, Controls Leaf Petiole Angle in Soybean Safety, tolerability, and immunogenicity of a recombinant adenovirus type-5 vectored COVID-19 vaccine: a dose-escalation, open-label, non-randomised, first-in-human trial Therapeutic options for the 2019 novel coronavirus (2019-nCoV) Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic Genome rearrangement with gene families Exemplar longest common subsequence Repetition-free longest common subsequence Divide-and-conquer approach for the exemplar breakpoint distance The Complexity of Calculating Exemplar Distances On the Approximability of Comparing Genomes with Duplicates The Exemplar BreakpointDistance for Non-trivial Genomes Cannot Be Approximated Approximability and Fixed-Parameter Tractability for the Exemplar Genomic Distance Problems A polynomial algebra method for computing exemplar breakpoint distance A Dynamic Programming Algorithm For (1,2)-Exemplar Breakpoint Distance An Exact Algorithm for the Zero Exemplar Breakpoint Distance Problem The Zero Exemplar Distance Problem The Longest Common Exemplar Subsequence Problem A branch-and-cut approach to the repetition-free longest common subsequence problem On the parameterized complexity of the repetition free longest common subsequence problem Construct, merge, solve and adapt: application to the repetition-free longest common subsequence problem Beam-ACO for the repetition-free longest common subsequence problem A hybrid genetic algorithm for the repetition free longest common subsequence problem Detection and analysis of ancient segmental duplications in mammalian genomes Genome rearrangements in mammalian evolution: lessons from human and mouse genomes DNA sequencing technologies key to the Human Genome Project The origin of man: a chromosomal pictorial legacy Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly Insights into hominid evolution from the gorilla genome sequence The tiger genome and comparative analysis with lion and snow leopard genomes 3014 1552 4241 1150 403 41458703-41609288 Gorilla 1440 949 4241 2353 1025 41482034-41811089 chr16 Human 2558 1420 2450 528 146 59234107-59359477 Gorilla 1144 764 2450 1148 404 54781291-55051203 A statistic with column title "AG" indicates the number of annotated genes in the human or gorilla chromosomes 7 and 16. A statistic with column title "AG" "AG∧PG" indicates the number of annotated genes that overlap with some pseudo genes in CES-7 or CES-16. A statistic with column title "PG" indicates the number of pseudo genes in CES-7 or CES-16. A statistic with column title "PG∧AG" indicates the number of pseudo genes in CES-7 or CES-16 that do not overlap with any annotated gene. A statistic with column title "PG∧AG(10000)" indicates the number of pseudo genes in CES-7 or CES-16 that do not overlap with any annotated gene and have at least 10000 bases. of the human chromosome i and its gorilla homologue is concealed in the columns with 735 titles "Human" and "signH" (resp. "Gorilla" and "signG"). One can verify that the 736 subsequence of annotated genes that occur in the column with title "Human" 737 accompanied with statistics of " * " in the column with title "signH" is identical to the 738 subsequence of genes that occur in the column with title "Gorilla" accompanied with 739 statistics of " * " in the column with title "signG". Any of these two annotated gene