key: cord-0035723-miz02qro authors: Kuruppu, Shanika; Puglisi, Simon J.; Zobel, Justin title: Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval date: 2010 journal: String Processing and Information Retrieval DOI: 10.1007/978-3-642-16321-0_20 sha: 9ca7cf85383ef6628658ba23ee9f023f96ce93b2 doc_id: 35723 cord_uid: miz02qro Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just [Formula: see text] bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires [Formula: see text] time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory. The emergence of high-throughput sequencing technologies, capable of sequencing entire genomes in a single run, has lead to a dramatic change in the number and type of sequencing projects being undertaken. In particular, it is now feasible to acquire and study variations between many individual genomes of organisms from the same species. While the total size of these sets of genomes will be large, individual genomes will not greatly vary, and so these collections represent new challenges for compression and indexing. In this paper we address the following problem. Given a collection C of r sequences T k ∈ C such that |T k | = n for 1 ≤ k ≤ r and copies of the base sequence T 1 containing overall s point mutations, the repetitive collection indexing problem is to efficiently store C while allowing queries of the form display(i, j, k) to efficiently return the substring T k [i..j]. We describe a solution to this problem that requires nH k (T 1 )+s log n+s log N s + O(s) bits of space and O( + log 1+ n) time to return a requested substring of length for any constant > 0, assuming a constant alphabet size. Our approach is to use the base sequence as a dictionary for compression of the other sequences. The collection is parsed into factors in an LZ77 [11] manner, but references to substituted strings are restricted to be in the base sequence only. Arbitrary substrings can then be decoded by reference to the base sequence. This work is inspired by the recent work of Mäkinen et al. [7] who, to our knowledge, were the first to tackle the above problem. BioCompress [5] was the first compression algorithm to be specific to DNA sequence compression, by simple modifications such as encoding nucleotides with 2 bits per base and detecting reverse complement repeats. Two further variations on the BioCompress theme, Cfact [10] and Off-line [1] , both work in rounds, greedily replacing duplicate text with shorter codes. GenCompress [3] showed that by considering approximate repeats the results could be improved. Since GenCompress, most DNA compression algorithms have been based on efficient methods of approximate repeat detection. Many of these early DNA compression algorithms are only able to compress small files. Due to the latest advances in DNA sequencing technologies, much larger DNA sequencing projects are underway. As a result, many individual genomes from the same species are being sequenced, creating a large amount of redundancy. Recent algorithms have specifically addressed the issue of compressing DNA data from the same species. Christley et al. [4] compresses variation data from human genomes and encodes the mutations and indels with respect to the human reference sequence and known variations recorded in a SNP database. However, large resources are required to be shared among users of the compressed data (4.2 GB of reference and SNPs in Chirstley et al.'s software). Furthermore, it does not support random access into the sequences. Before explaining our method we introduce notation and review several pieces of algorithmic machinery on which our results rely. For convenience, we assume no factors are generated by rule (a) above; that is, if c occurs in T i for i ≥ 2 then c also occurs in T 1 . If T 1 is not so composed we can simply add the at most σ − 1 missing symbols to the end of it. Compressed Integer Sets. We make use of a compressed set representation due to Okanohara and Sadakane [9] (called "sdarray" in their paper). Given a set S of m integers over a universe u this data structure supports the opera- We now describe how to store a collection of related sequences, while still allowing efficient access to arbitrary substrings of any of the constituent sequences. Proof. T 1 , the base sequence, is stored in n log σ bits (in the obvious way). Let Z i = LZ(T i |T 1 ) = (p 1 , 1 ), (p 2 , 2 ), . . . , (p z , zi ) for i > 1 be the parsing of text T i relative to T 1 . We store Z i in two pieces. The position components of each factor, p 1 , . . . , p zi are stored in a table P i [1..z i ] taking z i log n bits. P i is simply a concatenation of the bits representing p 1 , . . . , p z . Each entry in P i is log n bits long, allowing access to any entry p j = P i [j] in O(1) time. The length components are stored in a compressed integer set, L i , containing the values j = u k=1 k for all u ∈ 1..z i . In other words, the values in L i are the starting positions of factors in T i . Via a select query, L i allows us to access a given p j as P i [select(L i , j)]; and the length of the jth factor, j , is simply select(L i , j + 1) − select(L i , j) for j ∈ 1..z − 1 (because of the terminating sentinel we always have z = 1). Furthermore, the factor that T i [j] falls in is given by rank(L i , j). L i is stored in the "sdarray" representation [9] , which requires z log N z + O(z) bits and allows rank in O(log N z ) time and select in O(1). As described, P i and L i allow, given j, fast access to p j and j : simply a select query on L i to get j and a lookup on P i to retrieve p j . Given a position k in sequence T i , we can determine the factor in which k lies by issuing a rank query on L i , in particular rank(L i , k). To find a series of consecutive factors we only need to use rank in obtaining the first factor. For the others, the values can be retrieved using repeated select queries on L i and the p values by accessing consecutive fields in P i , both in constant time per factor. This observation allows us to extract any substring T i [s..e] in O(e − s + log N z ) time overall. We can reduce the n log σ term in the size of the above data structure to nH k bits by storing T 1 as a self-index instead of in plain form. This increases the cost to access a substring of length to O( log 1+ n + log N z ) worst-case time. It is possible to build our data structure in O(n + N log n) time and n log σ + n log n bits of extra space. The basic idea is to build the suffix array for T 1 and "stream" every other sequence against it to generate the LZ(T i |T 1 ) parsings. C of r sequences T k ∈ C such that |T k | = n for 1 ≤ k ≤ r and r k=1 |T k | = r · n = N , with LZ(T i |T 1 ) = (p 1 , 1 ), (p 2 , 2 ), . . . , (p zi , zi ) , for 2 ≤ i ≤ r and z = r i=2 z i , The compression performance of our algorithm, which we call RLZ, is compared to the algorithms XM [2] , which is known to currently be the best single sequence DNA compression algorithm, Comrad [6] , which specialises in compression of large related DNA datasets, and RLCSA, an implementation of a self-index from Mäkinen et al. [7] . The RLZ display() function is also compared to RLCSA [7] . Table 1 compares RLZ with RLCSA, Comrad and XM by compressing datasets containing many real biological sequences that slightly vary from each other. The datasets are S. coronavirus with 141 sequences, S. cerevisiae with 39 genomes, S. paradoxus with 36 genomes, and H. sapien with 4 genomes. The compression performance of RLZ is varied (Table 2) . Unsurprisingly, for the smaller datasets, XM has the best compression results. For the larger dataset, RLZ is produces better compression results than Comrad while using less memory (≈45 Mbyte for the yeast sets). RLCSA doesn't perform as well as the other algorithms, but it uses less memory (≈100 Mbyte for yeast) and is faster than Comrad and XM. Overall, RLZ compress and decompress much faster, and uses drastically less memory, than RLCSA, Comrad and XM. For the H. sapien dataset, each chromosome of each dataset was compressed against the respective reference genome chromosomes for RLZ and RLCSA. We do not report XM results since it took nearly 6 hours for a single chromosome 1 sequence to compress. RLZ performed very well on this dataset compared to RLCSA and Comrad (RLZ used ≈1 Gbyte of memory while RLCSA and Comrad used ≈2 and ≈16 Gbyte respectively). With the inclusion of the 7-zipped human reference genome, the total dataset of three human genome sequences (summing to 12 Gbase) can be represented in just under 755 MByte with RLZ; of this, 643 Mbyte is the overhead for the base sequence, and we anticipate that roughly 27,000 human genomes could be stored in a terabyte, a massive increase on the 1500 or so that could be stored using methods such as 7-zip. Results for the display(i,s,e) (retrieve substring T i [s..e]) function are in Table 3 . RLZ displays substrings significantly faster than RLCSA for small query lengths and continues its good performance for even larger query lengths, with approximately 0.022 (μsec/c) for RLZ compared to 0.77 (μsec/c) for RLCSA. Tests were conducted on a 2.6 GHz Dual-Core AMD Opteron CPU with 32Gb RAM and 512K cache running Ubuntu 8.04 OS. The compiler was gcc v4.2.4 with the -O9 option. A key issue is choice of a reference sequence. While selecting the reference genome is a simple matter for the yeast datasets, it may not be a good representation of the other individual sequences. To observe the compression performance for different reference sequence choices, we compressed the dataset by selecting each genome in turn as the reference. Selecting a genome such as DBVPG6765 leads to 16.5 MByte as opposed to selecting UWOPS05 227 2, which led to 24.5 MByte. We also explored the effect on compression when a reference sequence that is unrelated to the original dataset is selected. 35 genomes of S. paradoxus (excluding REF genome) was compressed against the S. cerevisiae REF genome. The compressed size was 112.09 MByte compared to the result of 2.04 MByte when the S. paradoxus REF genome was used. RLZ's effectiveness is at its best when compressing related sequences and care must be taken when selecting a reference sequence. Compression of biological sequences by greedy off-line textual substitution A simple statistical algorithm for biological sequence compression A compression algorithm for DNA sequences and its applications in genome comparison Human genomes as email attachments Compression of DNA sequences Repetition-based compression of large DNA datasets Storage and retrieval of highly repetitive sequence collections Compressed full text indexes Practical entropy-compressed rank/select dictionary A guaranteed compression scheme for repetitive DNA sequences A universal algorithm for sequential data compression