key: cord-0599955-xxz07t3z
authors: Liiv, Innar
title: SARS-CoV-2 Coronavirus Data Compression Benchmark
date: 2020-12-21
journal: nan
DOI: nan
sha: 275bcb492a002b73449665a9f91fa9bd3357ea4f
doc_id: 599955
cord_uid: xxz07t3z

This paper introduces a lossless data compression competition that benchmarks solutions (computer programs) by the compressed size of the 44,981 concatenated SARS-CoV-2 sequences, with a total uncompressed size of 1,339,868,341 bytes. The data, downloaded on 13 December 2020, from the severe acute respiratory syndrome coronavirus 2 data hub of ncbi.nlm.nih.gov is presented in FASTA and 2Bit format. The aim of this competition is to encourage multidisciplinary research to find the shortest lossless description for the sequences and to demonstrate that data compression can serve as an objective and repeatable measure to align scientific breakthroughs across disciplines. The shortest description of the data is the best model; therefore, further reducing the size of this description requires a fundamental understanding of the underlying context and data. This paper presents preliminary results with multiple well-known compression algorithms for baseline measurements, and insights regarding promising research avenues. The competition's progress will be reported at url{https://coronavirus.innar.com}, and the benchmark is open for all to participate and contribute.

Marvin Minsky considered Kolmogorov, Chaitin, and Solomonoff's algorithmic information theory "the most important discovery since Gödel" and conjectured that "practical approximations to [their theory]. . . would make better predictions than anything we have today" [21] .

This competition is intended to encourage multidisciplinary research, in the spirit of Kolmogorov, Chaitin, and Solomonoff's theory, to develop the shortest lossless description for the sequences of SARS-CoV-2. A successful result will serve as a demonstration that data compression can offer an objective and repeatable measure to align scientific breakthroughs across disciplines. The shortest description of a dataset is the best model. Further compression of the sequences of SARS-CoV-2 will require a fundamental understanding of the data and its context.

The main theoretical underpinnings of this benchmark are the practical approximations to Kolmogorov-Chaitin-Solomonoff complexity by Li and Vitanyi [16] and the minimum description length principle [24, 9] .

Kolmogorov complexity is the length of the shortest effective description of an object [14] . Therefore, the idealistic goal of the SARS-CoV-2 coronavirus data compression benchmark is to find the Kolmogorov complexity of SARS-CoV-2 [29] sequences. Since doing so requires an infinite amount of work, as a practical approximation, the smallest archive plus the decompressor is considered a computable proxy. Matt Mahoney has written an inspiring and excellent rationale for a large text compression benchmark [19] with an extended discussion about the connections between intelligence and compression.

Several lossless compression benchmarks have been proposed over the years [3, 11] , the most well-known by Marcus Hutter, who offered €500,000 as a challenge prize [11] . The compression of genetic sequences, as a specific niche of data compression research, has been a popular topic for more than 25 years [7, 8] . An interested reader is referred to two comprehensive surveys about data compression methods for biological sequences [2, 10] . Kryukov et al. have recently presented a comprehensive evaluation of reference-free compressors for FASTA-formatted sequences [15] and developed a sequence compression benchmark database.

De Maio et al. have identified several oddities specific to SARS-CoV-2 sequencing data, which may "arise from specific combinations of sample preparation, sequencing technology, and consensus calling approaches" [6] . Such aspects, and other systematic errors typical to sequence data [20] , can support the design of a specific compression strategy.

Losslessly compress the 1.25GB file coronavirus.fasta [17] or its 2bit representation equivalent coronavirus.2bit (0.31 GB) [17] to less than 1,238,330 bytes (the current smallest compressed size of the dataset, including the decompressor).

The data is presented in FASTA and 2Bit (UCSC-twobit [12] ) format, consisting of 44,981 concatenated SARS-CoV-2 sequences with a total uncompressed size of 1,339,868,341 bytes [17] . It was downloaded on 13 December 2020 from the severe acute respiratory syndrome coronavirus 2 data hub of ncbi.nlm.nih.gov [23] . Each participant can choose which file to use-that is, the compressor does not have to work on both datasets.

The challenge is to compress 44,981 concatenated SARS-CoV-2 sequences. To provide a slightly simpler example, more susceptible to manual observation, the compression results for one sequence (reference sequence NC 045512 [29] ) are presented in Table 1 . Table 2 presents the current compression results for for 44,981 SARS-CoV-2 sequences (with a total uncompressed size of 1,339,868,341 bytes) sorted by the number of bytes (with fewer bytes meaning better compressibility), acting as the baseline measurement for the challenge. The bytes column in Table 2 did not include the size of the decompressor, which will be considered in the final benchmark. Considering the total size of the compressed archive and the decompressor (instead of just considering the compressed archive), the PAQ8L compressor [18] by Matt Mahoney performed the best, with the best results achieved using the 2Bit format of the dataset. The resulting compressed archive for PAQ8L, including the compressed decompression executable, has a total size of 1,238,330 (1,207,839+30,491) bytes. The CMIX compressor [13] by Byron Knoll resulted a smaller compressed archive (988,958), but the total size, including the compressed decompressor, is 1,282,852 (988,958+293,894). 

The sequences of the SARS-CoV-2 coronavirus are compressible. Further compression will require a mix of novel and creative approaches: moving beyond the state of the art of data compression or understanding the patterns and relationships within parts of sequences and between sequences.

The SARS-CoV-2 coronavirus data compression benchmark has a vital multidisciplinary aspect: the objective and repeatable measure in this challenge can help to align scientific breakthroughs across disciplines. At the end of the day, different theories and models to understand the coronavirus are measurable through the shortest description of the dataset.

In addition, the scientific momentum around and attention paid to SARS-CoV-2 can be applied to support breakthroughs by the data compression community and advance the state of the art of compression. The techniques used for improving the compression of SARS-CoV-2 datasets can feed back to better understanding the underlying mechanisms of the coronavirus.

Brotli: A general-purpose data compressor

Compression of fastq and sam format sequencing data

The human knowledge compression prize

Zstandard compression and the application/zstd media type

Xz utils

Issues with sars-cov-2 sequencing data

Compression of dna sequences

A new challenge for compression algorithms: genetic sequences. Information Processing & Management

The minimum description length principle

A survey on data compression methods for biological sequences

The human knowledge compression prize

The ucsc genome browser database

Cmix data compression program

Three approaches to the quantitative definition of information

Sequence compression benchmark (scb) database-a comprehensive evaluation of reference-free compressors for fasta-formatted sequences

An introduction to Kolmogorov complexity and its applications

Sars-cov-2 coronavirus data compression benchmark

Paq8 data compression program

Rationale for a large text compression benchmark

Identification and correction of systematic error in high-throughput sequence data

Panel: The limits of understanding

NCBI: Severe acute respiratory syndrome coronavirus 2 data hub

Modeling by shortest data description

Rar data compression program

Data compression: the complete reference

Bzip2 data compression program

Efficient dna sequence compression with neural networks

A new coronavirus associated with human respiratory disease in china