key: cord-0329424-5z7ts1h1
authors: Hoogstrate, Youri; Jenster, Guido; van de Werken, Harmen J. G.
title: FASTAFS: file system virtualisation of random access compressed FASTA files
date: 2020-11-11
journal: bioRxiv
DOI: 10.1101/2020.11.11.377689
sha: 9417c14ede3419248677838595c35ccfe67c101c
doc_id: 329424
cord_uid: 5z7ts1h1

Background The FASTA file format used to store polymeric sequence data has become a bioinformatics file standard used for decades. The relatively large files require additional files beyond the scope of the original format, to identify sequences and provide random access. Currently, multiple compressors have been developed to archive FASTA files back and forth, but these lack direct access to targeted content or metadata of the archive. Moreover, these solutions are not directly backwards compatible to FASTA files, resulting in limited software integration. Results We designed linux based a toolkit using Filesystem in Userspace (FUSE) that virtualises the content of DNA, RNA and protein FASTA archives into the filesystem. This guarantees in-sync virtualised metadata files and offers fast random-access decompression using Zstandard (zstd). The toolkit, FASTAFS, can track all system wide running instances, allows file integrity verification and can provide, instantly, scriptable access to sequence files and is easy to use and deploy. Conclusions FASTAFS is a user-friendly and easy to deploy backwards compatible generic purpose solution to store and access compressed FASTA files, since it offers file system access to FASTA files as well as in-sync metadata files through file virtualisation. Using virtual filesystems as in-between layer offers the possibility to design format conversion without the need to rewrite code into different languages while preserving compatibility. Code Availability https://github.com/yhoogstrate/fastafs

FASTA is a file format used for storing nucleotide and amino acid polymeric sequences and is compatible 33 with a high variety of bioinformatics software. It is used as database for ribosomal RNA sequences, but 34 also for eukaryotic reference genomes and protein databases that can be several gigabytes in size. In 35 contrast to for example GenBank, it offers very limited support for metadata. Corresponding fai-index 36 files are used to achieve random access by providing the sequence length, padding corrected file 37 positions and padding and line length. This is static information that is embedded in the FASTA file, 38 which is extracted after generating the FASTA file. 39

Scientific demand for reproducibility and interoperability of both software applications and data is 40 growing strongly and as a result unique identification and data integrity play a critical role. In the CRAM 41 data format, for instance, Next Generation Sequence (NGS) alignments are compressed relative to a 42 reference sequence. In this format, the reference sequences are addressed using their unique identifier 43 for interoperability. With the identifier, the corresponding sequence can be obtained directly using the 44 online European Nucleotide Archive (ENA) service (https://www.ebi.ac.uk/ena/cram/swagger-ui.html), 45

preserving the intrinsic link between the data file and the reference sequences. Because real-time 46 computation of identifiers can be computationally expensive, they are stored in separate dictionary files 47 (*.dict). Dict-files are, like fai-index files, beyond the scope of the original file format and have to be 48 generated and maintained after obtaining the FASTA file. 49 Current software applications make use of FASTA files as input in two different manners: 50

• First, a tool reads a FASTA text file sequentially and in one-direction, starting with the first 51 character in the file. For example, short-read alignment algorithms, but also motif-scanners that 52 iteratively search for a given motif [1] • Second, a tool reads a FASTA file in a random-access fashion by starting at an arbitrary location 56 in the file and has the possibility to make jumps, forwards but also backwards, through the file. 57

The precise file coordinates is typically calculated using the fai-index file. For example, a request 58 to a genomic region within a genome browser is such a random-access request, since a next 59 query can be expected at any genomic location. If underlying FASTA file access does not support 60 jumping through a file it is necessary to copy a file entirely into memory. This procedure is 61 extremely resource intensive and can slow a process significantly. Bioinformatics tools that rely workaround to avoid file duplication is to make use of (named) pipes [10] . A pipe is a virtual, one-82 directional, data stream, that stays in idle as long as no further data requests come in. This could e.g. be 83 the output of a decompressor. This is resource efficient as data access is chunked, but is not a generic 84 solution as it does not offer random access. Access to FASTA archives in a random-access use case 85 requires an available compression API that supports random access explicitly. If these conditions are not 86 met, the primary goal of compression is then in practice lost. The FASTA file is still needed and having 87 both the original and its compressed equivalent costs effectively more space rather than it saves. 88

Currently available bioinformatics applications that make use of FASTA files in a random-access setting 89 mostly support only FASTA files and no compressed equivalents. Therefore, it is in practice necessary to 90 keep a flat copy of a FASTA file with the corresponding the fai-index file. For systems limited to 91 applications with streaming access to FASTA files, a decompression binary in combination with (named) pipes is an ideal way to use FASTA archives, although it requires management of metadata files. Instead 93 of using a classical file converter binary for decompression, we can also file virtualisation. This way, file 94 virtualization functions as layer between a compressed archive and the virtually mounted FASTA plus 95 metadata files, which offers multiple advantages over classical (de-)compression binaries: 96

• Virtual files and their system calls are identical to flat file system calls. For tools that are only 97 compatible with FASTA files, this preserves backwards compatibility, also for random access use-98 cases. 99

• There is no need to use additional disk space for temporary decompression and no need to read 100 entire FASTA files into memory. encoding followed by generic compressor Zstandard (zstd), but it lacks random access. Given that NAF 124 achieves high compression ratios [10] , FASTAFS was designed in a somewhat similar fashion as it first 125 compresses sequence data to a lower bit encoding (2-bit, 4-bit or 5-bit), followed by the random-access 126 implementation of zstd called zstd-seekable. 127 In addition, FASTAFS provides filesystem access to query partial sequences using a subsequence 141 identifier as filename in the 'seq' subdirectory. For example, the file <mountpoint>/seq/chr1:10-20 142 contains only the sequence of this region, without additional characters such as newlines or spaces. 143

Subsequently, requesting the file size of <mount point>/seq/chr1 will provide its size in nucleotides. 144

Indeed, these additional features do not solve backwards compatibility issues, but provide virtualised 145 random access, without using the fai-index file, by functioning programming language independent API 146 implemented at filesystem level. 147

List: The 'fastafs list' command gives an overview of the FASTAFS archives, their alias, number of 148 sequences, format, compression ratio and all active mount points ( Figure S1A) . 149

View: Besides mounting, the FASTA contents can be decompressed to stdout using 'fastafs view', of 150 which the padding can be set to a desired value and masking can be virtually disabled. The contents can 151 also be exported to UCSC TwoBit format (Figure S1B ).

The 'fastafs info' subcommand gives information about the file layout, sequence size, the per-153 sequence MD5 checksum and used compression type. This subcommand can also be used to query 154 European Nucleotide Archive (ENA) [15] whether the existence of a sequence MD5 checksum can be 155 verified (Figure S1C) . 156

Check: The 'fastafs check' command checks the file integrity using a CRC32 checksum. Integrity of 157 compressed sequence data blocks can be checked separately using their MD5 checksums with the '--158 md5' argument ( Figure S1D) . The FASTA file format is used to store biological polymeric sequence data in an easy-to-use format that 180 has become a file standard in bioinformatics. Static information is embedded within each file, but needs 181 to be extracted and stored in additional files to complement the FASTA file. We have developed a method, FASTAFS, to virtualise FASTA files along with their metadata files into the file system. The 183 implementation makes use of the zstd-seekable compression library, which makes random access to the 184 virtual FASTA files possible. FASTAFS comes with a feature rich toolkit that can manage the archives, 185 their locations, their file integrity and provides file access in a backwards compatible manner to regular 186 FASTA file access. This allows the archives to be used in existing software without the need for 187 adaptation for compatibility and without the use of additional APIs. The layout of the FASTAFS format consists of four blocks, starting with the file header, followed by the 258 per-sequence data, the per-sequence header data and a metadata block. The file header has a file 259 pointer to the per-sequence header block, where each sequence has a file pointer to its data. The file 260 ends with a metadata block, currently supporting a CRC32 checksum. The raw FASTAFS file is 261 subsequently compressed with zstd-seekable. The full specification is available on the website: The fastafs ps command can be used to retrieve all running instances of FASTAFS with corresponding 295 process id's and mount points. 296

Simple combinations of lineage-determining transcription factors prime cis-208 regulatory elements required for macrophage and B cell identities

SortMeRNA: fast and accurate filtering of ribosomal RNAs in 211 metatranscriptomic data

STAR: Ultrafast universal RNA-seq aligner

The Subread aligner: fast, accurate and scalable read mapping 216 by seed-and-vote

JBrowse: a dynamic web platform for genome visualization and analysis

VarScan 2: Somatic mutation and copy number alteration discovery in 221 cancer by exome sequencing

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File 224 Manipulation

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-227 generation DNA sequencing data

Nucleotide Archival Format (NAF) enables 231 efficient lossless reference-free compression of DNA sequences

MFCompress: a compression tool for FASTA and multi-FASTA data

DSRC 2-Industry-oriented compression of FASTQ files

CRAM format specification (version 3.0: 2fcaab6)

Sequence Alignment/Map Format 239 Specification (version 1.6: f2a6b99)

CRAM reference registry

MiRBase: Annotating high confidence microRNAs using deep 242 sequencing data

Snakemake-a scalable bioinformatics workflow engine

Nextflow 246 enables reproducible computational workflows

Tabix: fast retrieval of sequence features from generic TAB-delimited files

The SILVA ribosomal RNA gene database project: Improved data processing and 251 web-based tools

UniProt: The universal protein knowledgebase

DNA (Coliphage phi-X174: NC_001422) and RNA viruses (SARS-CoV-2: 271 NC_045512.2), databases with small RNAs (miRbase and tRNAs