key: cord-0660823-75mi9587
authors: Mohideen, Rahul Mohideen Kaja; Peter, Pascal; Weickert, Joachim
title: A Systematic Evaluation of Coding Strategies for Sparse Binary Images
date: 2020-10-26
journal: nan
DOI: nan
sha: 253800c8c7e1ba3936e07bb2d8be95b48f79ab3d
doc_id: 660823
cord_uid: 75mi9587

Inpainting-based compression represents images in terms of a sparse subset of its pixel data. Storing the carefully optimised positions of known data creates a lossless compression problem on sparse and often scattered binary images. This central issue is crucial for the performance of such codecs. Since it has only received little attention in the literature, we conduct the first systematic investigation of this problem so far. To this end, we first review and compare a wide range of existing methods from image compression and general purpose coding in terms of their coding efficiency and runtime. Afterwards, an ablation study enables us to identify and isolate the most useful components of existing methods. With context mixing, we combine those ingredients into new codecs that offer either better compression ratios or a more favourable trade-off between speed and performance.

Traditional lossy image compression methods such as the JPEG [1] family of codecs or HEVC [2] are transform-based: They compress images by applying a cosine or wavelet transform and quantising the coefficients coarsely, which constitutes a loss of information in the frequency domain. In recent years, inpainting-based codecs [3] [4] [5] [6] [7] [8] [9] have challenged this concept. They discard information directly in the spatial domain by representing the images through a sparse subset of pixels also known as the inpainting mask or in short, the mask.

For decoding, they reconstruct the discarded information through inpainting. Especially for high compression ratios, they can outperform transform-based codecs.

However, these approaches require careful and unconstrained optimisation of both the positions and values of the known data [4, [10] [11] [12] [13] [14] . This creates a delicate compression problem: In general, storing optimised mask location data is expensive from an information theoretical perspective. We can argue that data optimised for quality contain more information about the signal, which in turn would have higher entropy therefore making it more expensive to store [15] . Additionally, any deviations, e.g. by inexact, cheaper positions, can also reduce the reconstruction quality during decompression. While having deviations in pixel values is feasible and many inpainting operators are robust under quantisation [6, 7, 9] , storing pixel locations is a much more sensitive task. Storing pixel locations is equivalent to storing a sparse binary image. Depending on the inpainting operator, the distribution of these points aligns with image structures or might be lacking obvious patterns (see Fig.3 ).

Most inpainting operators perform best for unconstrained, optimised data that are stored losslessly [7] . Consequently, many optimisation methods [4, [10] [11] [12] [13] [14] would benefit from lossless codecs specifically aimed at optimised inpainting masks. Even beyond inpainting -based compression, unconstrained mask codecs could be useful for other applications such as storage of sparse image features (e.g. SIFT [16] , SURF [17] ). However, there is only a small number of publications [5, 7, 8] that directly address the lossless compression of unconstrained masks.

To close this gap, we offer the first systematic review and in-depth analysis of suitable compression techniques from different fields and propose best practice recommendations for lossless coding of unconstrained masks. Our contributions are three-fold:

We start with a systematic review of suitable coding strategies from different fields of compression. We cover dedicated positional data compression methods [5, 8] , image compression codecs (JBIG [18] , PNG [19] , JBIG2 [20] , DjVu [21] , BPG lossless [22] ), and state-of-the-art general purpose encoders (PAQ, LPAQ [23] ). In addition, we discuss how these approaches can be applied to different representations of the sparse binary images, such as run-length encoding and methods from data science like Coordinate List (COO) [24] and Compressed Sparse Row (CSR) [24] .

Afterwards we perform an ablation study of context mixing methods. Such approaches combine information from multiple sources of already encoded information to predict the remaining file content, thus increasing compression efficiency. In a top-down approach, we evaluate which parts of complex stateof-the-art methods like LPAQ2 [25] are useful. Moreover, we use a bottom-up strategy to construct a novel context mixing method that is simple, yet effective.

Finally we provide a methodical comparison of all existing and newly proposed methods from the previous two contributions w.r.t. compression efficiency and runtime. In our experiments, we choose the Kodak image dataset [26] , which is widely used in compression studies. In order to obtain meaningful results, we consider different densities of known data (1% through 10%) and different point distributions. We take into account random masks, results from probabilistic sparsification [11] with homogeneous diffusion [27] , and densification [28] with Shepard interpolation [29] . In total, this yields a database of 720 diverse real-world test images on which we test all of the methods that we consider. This setup enables us to propose best practice recommendations based on our analysis.

Overall, our three contributions allow us to identify the coding methods that yield the best lossless compression performance as well as the best tradeoff between speed and coding efficiency.

During the last decade, inpainting-based compression has become a diverse field of research with many conceptually different approaches. While most of the existing works focus on the optimisation of known data and inpainting operators, they have also contributed some concepts for efficient storage methods. We distinguish four categories:

1) Regular grid approaches restrict the positions of sparse known data to a fixed grid. Some codecs specify only a global grid size of Cartesian [9] or hexagonal [30] grids, thus steering the density of the inpainting mask. Apart from a few exceptions such as RJIP (Regular grid codec with Joint Inpainting and Prediction) [9] , most regular grid approaches for inpainting-based compression do not offer competitive performance.

2) Subdivision approaches additionally allow some adaptation to the image. Most of them use error-based refinement of the grid, splitting areas with high inpainting error into smaller subimages with a finer grid. They store these splitting decisions efficiently as a binary or quad tree. Originally, Galić et al. [3] and Livada et al. [31] employed a triangular subdivision while later works [7, 32] use the more efficient rectangular subdivision by Schmaltz et al. [6] .

3) Semantic approaches select known data directly based on image features and belong to the oldest inpainting-based compression techniques. They date back to a 1988 paper by Carlsson [33] . Contemporary semantic codecs are particularly popular for the compression of images with pronounced edges [34, 35] such as cartoons [36] [37] [38] [39] , depth maps [30, 40, 41] , or flow fields [42] . These methods extract and store image edges and use those as known data for inpainting. Since edges are connected structures, chain codes can be used for efficient encoding. 4) Unconstrained mask approaches: While there are numerous publications that deal with optimal selection of sparse known data for inpainting [4, [11] [12] [13] [14] [43] [44] [45] [46] [47] [48] , only few use them for actual compression. Hoffman et al. [30] primarily employ a regular grid as already mentioned, but they also rely on some unconstrained mask points to further enhance quality which are encoded with JBIG. Demaret et al. [10] originally utilises a hybrid approach that combines original image [26] regular mask subdivision mask (R-EED [6] ) edge mask (cartoon unconstrained mask compression [38] ) (sparsification [11] ) Figure 1 : In the mask images, the black pixels mark the positions of the known image data. This image illustrates the mask structures we get from the different approaches mentioned in Section 1.2. Throughout the paper, our focus will be on the compression of unconstrained masks.

a subdivision strategy with storage of exact positions. In a later publication [5] , Demaret et al. replaced it by a more efficient context-based entropy coding method. Similar approaches where taken by Peter et al. [7] and Marwood et al. [8] who either apply different contexts or additional context mixing strategies. As discussed above, for almost all mask types in Figure 1 , efficient compression strategies have been proposed. Unconstrained masks are the only exception. Therefore in the present work, we restrict ourselves to the compression of unconstrained masks as they are the hardest to compress out of all the different types of inpainting masks that were listed.

In Section 2, we give a short overview of inpainting-based compression, which establishes the foundations for our systematic analysis. We introduce several alternative approaches to represent point locations in Section 3. We then review different entropy coding methods in Section 4. In Section 5, we compare different popular image compression codecs. We investigate the successful ideas of context mixing in detail with an ablation study in Section 6. Furthermore, to give best practice recommendations, we perform a consolidated comparison of the best methods that were considered in previous sections in Section 7. Finally, we discuss our conclusions and ideas for future work in Section 8. On the compression side, we optimise the known data by repeated inpaintings and adjusting the known data such that the inpainting error is minimised. Then the known mask positions and values are compressed using an entropy coder. The compression flow of actual codecs are more complex and depend on the type of mask and inpainting operator, but all of them follow this general structure. On the decompression side, it is straightforward. We only need to decode the mask locations and values and do the inpainting.

Inpainting-based compression relies on two main components: an inpainting method for reconstruction and a selection approach for the sparse known data to be stored. This can also be seen in the general structure of inpainting-based codecs given below in Figure 2 .

Let f : Ω → R denote a grey value image that is only known on a subset K ⊂ Ω, also called the inpainting mask. We aim to reconstruct the missing values in Ω \ K by solving for the steady state (t → ∞) of the diffusion process [49] 

n Du = 0 on ∂Ω × (0, ∞).

Here, u(x, y, t) is the grey value at time t and position (x, y), and n is the outer image normal. The spatial gradient operator is denoted by ∇ = (∂ x , ∂ y ) , and div = ∇ is the divergence. The diffusion tensor D is a positive definite matrix which specifies the behaviour of the diffusion process.

Eq. 1 models the evolution of the image over time. Eq. 2 and Eq. 3 define the boundary conditions for the diffusion process. The Dirichlet boundary conditions in Eq. 2 guarantee that known pixel values stay unmodified. Eq. 3 uses Neumann boundary conditions to avoid diffusion across the image boundaries. Finally, Eq. 4 denotes the initial condition which assigns values to the known pixels at t = 0.

The simplest choice for the diffusion tensor is D = I , where I is the identity matrix. In that case, Eq. 1 can be written as

This equation describes homogeneous diffusion as the propagation of information is equal in all directions at all locations [27] . There are more sophisticated choices for D which allow e.g. direction-dependent adaptation of the inpainting as in edge-enhancing anisotropic diffusion [50] . However, a full review of inpainting operators is beyond the scope of our compression-focused contribution.

Shepard Interpolation is a simple inpainting method that uses inverse distance weighting [29] , i.e. pixel values are inpainted by receiving more information from closer pixel neighbours than from those further away. Considering an image f : Ω → R where the inpainting mask is given by K ⊂ Ω and Ω is the image domain. We compute the reconstruction u at a point given by the position vector p as

.

We use a truncated Gaussian G(x) := exp(− x 2 2 /(2σ 2 )) for the weighting function w. The standard deviation σ is adapted to the mask density according to σ = (m · n)/(π|K|) as proposed by Achanta et al. [51] , where m and n are the width and height of the image respectively.

Beyond the two aforementioned types of inpainting methods, there are many more, such as exemplar-based [14, 52] or pseudodifferential inpainting [53] . While each of those methods has interesting properties, including them into our considerations would go beyond the scope of our evaluation.

While there are many different strategies to find mask points, we use two of the most popular ones from the literature.

original image [26] 5% random mask 5% sparsification + 5% densification + homogeneous diffusion Shepard interpolation Figure 3 : We can see here that even though we start with the same image, different point selection strategies or inpainting operators can lead to very different unconstrained mask distributions.

Sparsification [11] is a top-down algorithm where we start with the original image. It relies on the main idea that pixels which are harder to reconstruct are more important. In each iteration, the algorithm randomly chooses a set of candidate pixels and removes those with the lowest per-pixel inpainting errors which explains the probabilistic nature of the process. With the new mask, it updates the inpainting error before it proceeds to the next iteration. This way, the number of mask points is reduced until the desired density is achieved.

Densification [28] is the bottom-up analogue of sparsification. The algorithm starts with an empty mask image and a fixed amount of mask pixels is added in every iteration until the required density is achieved. However, it has been found that densification generally performs worse than sparsification due to the fact that very little information is available in the beginning. For the same reason, densification tends to perform better on noisy images, where sparsification is prone to select noisy pixels [28] .

We consider these particular methods since they generate binary masks with very characteristic distributions as seen in Fig. 3 . As a worst case scenario for encoding we consider uniformly random distributions of mask points. On the other side of the spectrum, we use probabilistic sparsification [11] and homogeneous diffusion [27] , which creates masks that are highly adaptive to image structures. Finally, masks from probabilistic densification [28] with Shepard interpolation [29] are a compromise between both, since they are not randomly distributed, but conform less to image features.

Unless stated otherwise, all experiments in this paper consider average compression performance over all the aforementioned point selection methods, and over all mask densities from 1% to 10%.

In the previous section, we introduced different inpainting methods which generate the binary images we intend to compress. As with any other data stream, there are many ways to represent sparse binary images, and the way in which the data are represented has an impact on the final performance of lossless compression.

Since the salient information of a sparse binary image is the non-zero pixels, the compression ratio as a traditional metric is not the most transparent choice for our evaluation. To this end, we use bytes per mask pixel instead, which can be computed as the compressed file size divided by the number of non-zero pixels in the image. Such a normalisation allows us to attribute the cost to the salient image features that need to be stored.

In the following, we describe and compare common representations originating from different fields of research.

A naive representation that allows to apply many standard sequential entropy coders is the vector representation. We convert the image into a vector by traversing it row-by-row.

One of the most popular methods for compression of sparse images is runlength encoding. It replaces sequences of identical symbols, so-called runs, by their length.

For our setting where sparse binary images are considered, run-length encoding becomes even more efficient. We assume that known values are mostly isolated and thus only have to store runs of intermediate zeroes. Consecutive ones can be represented by a zero run.

The image has to be scanned in some order to calculate the run-lengths. Potential options include for example column-by-column, row-by-row, diagonal or zigzag as done in JPEG. We found that the scanning order does not have a significant impact on the final compressed file size. Therefore, we only consider the most straightforward approach which is column by column. 

A COO [24] stores each non-zero point as a (row, column) tuple. A differential scheme on the coordinates can also be considered, where we encode the difference in position between the current point and the previous point. However, if we sort the points based on their rows and columns, we get a representation that is almost identical to run-length encoding.

As another well known sparse image representation, CSR [24] is mainly used for fast operations on large images in scientific computing. CSR traverses the image row-by-row and stores the column positions of each non-zero element. In addition, the total number of non-zero elements encountered is stored when a complete row is encoded. We make a slight modification to the basic CSR method to yield better compression results. Instead of storing the total number of encountered non-zero elements, we only store the number of non-zero elements encountered in the previous row. This reduces the range of values needed to be stored thus reducing the source entropy.

In Fig. 4 , we compare the compression performance of different image representations w.r.t. density by compressing them with LPAQ. We see that the vector representation and run-length encoding are the better choices as they clearly lead to smaller compressed files than both COO and CSR.

The file size after compression depends on two aspects: the initial file size and the entropy of the file. The vectorised form has the same probability distribution as the original image. For a sparse binary image the probability distribution is heavily skewed towards zero, yielding a low entropy. This means that the original image is highly compressible which leads to high performance.

Run-length encoding gives the shortest representation, as can be seen in Table 1 , which in turn yields high compression performance especially for lower densities. On the other hand, COO and CSR have short representations, but not as short as RLE. Moreover, their entropy is not as low as the vectorised form. These factors combined make COO and CSR non-viable choices for compression. 

In the previous section, we discussed different ways in which binary images can be represented and their effect on the final compression performance. In this section however, we review approaches that constitute the final step of most image compression pipelines, namely entropy coders. Entropy coders are not specialised to image data, but are still a vital part and the final step of most compression codecs regardless of the nature of the input data. The principle of entropy coders is to convert input data into a more compact representation which can be converted back to its original form by exploiting the redundancies present in said data. The measure of the redundancy or compressibility of a symbol distribution A → {A 1 , A 2 , ..., A n } with a probability distribution p 1 , p 2 , ..., p n is given by the Shannon entropy [15] ,

Higher entropy implies that the input data has little redundancy and is harder to store. Note that for our purposes, the input alphabet depends on the chosen representation of the binary image from Section 3. For a vector representation, we have A = {0, 1}, and in the other representations A = {0, 1, ..., N } ⊂ N is a set of integers.

Huffman coding [54] is a classical prefix-free encoding scheme that maps a binary codeword to each symbol of the alphabet. As a classical entropy coder, Huffman coding aims at distributing the code lengths in such a way that the overall coding cost is minimised.

Huffman coding is simple, easy to implement, and fast. While Huffmancoding has been superseded by more efficient entropy coders, it is still part of some widely-used image compression codecs such as JPEG [1] and PNG [19] .

In contrast to Huffman coding, arithmetic coding can encode symbols with fractional bit cost per symbol, as it maps the whole source word directly to a binary code word.

In its original form [55] , arithmetic coding achieves this by an interval subdivision scheme. Since this strategy requires floating-point operations and is therefore slow and prone to rounding errors, we consider the WNC implementation of Witten, Neal and Cleary [56] . It replaces the real-valued intervals by sets of integers. This greatly improves computational performance and allows an easier generation of the binary code word.

Due to its good approximation of the Shannon limit, arithmetic coding is used in many lossless and lossy image compression codecs [2, 3, 5, 6, 8, 18, [20] [21] [22] and also forms the foundation for more advanced general purpose encoders such as context coding and context mixing methods which we will discuss in the subsequent sections.

Arithmetic coding estimates the symbol probabilities only from the symbol counts i.e. the number of times that the symbol has been encountered. This pure consideration of the occurrence frequency is referred to as a zeroth-order context from which the probabilities are derived. However, more complex contexts can be considered to incorporate more structural information of the input data into the encoding process, thus allowing a better compression performance.

As a direct extension of the zeroth-order context, we can use a history of m previously encoded symbols to predict the next symbol. The collection of m previously encoded symbols is called the mth-order context. For images, one can also consider 2D neighbourhoods of varying size (see Fig. 6 ). Using a single context allows us to adapt the probabilities to recurring patterns inside the scope of the corresponding contexts. This enables approaches like Prediction by Partial Matching (PPM) [57] , Context Adapative Binary Arithmetic Coding (CABAC) which is used in HEVC [2] and BPG [22] , and Embedded Block Coding with Optimised Truncation (EBCOT) which is used in JPEG2000 [22] which have better performance than standard arithmetic coding.

Context mixing approaches extend upon the idea of context coding. Instead of using only a single context to estimate the symbol probability, the probabilities derived from many contexts can be combined with weighted averaging.

Originating from the pioneering work of Mahoney, the so-called PAQ1 context-mixing algorithm [58] , many versions with increasingly sophisticated contexts have been developed. All versions of PAQ encode bits individually rather than the symbols themselves and convert input data into a bit stream accordingly. The small alphabet reduces the number of possible contexts and has the additional advantage that PAQ can be combined with fast binary arithmetic coding algorithms.

As general purpose compressors, the full versions of PAQ aim at compressing mixed data such as combinations of text, audio, images, and many other data types. Thus, many of their additional contexts would not be useful for our purposes. In the following sections, we consider LPAQ2 by Rhatushnyak [25] , since we identified it as the most promising member of the expansive PAQ family.

LPAQ is a lightweight variant of PAQ that uses six local contexts, and a prediction context. The local contexts depend on the bits already encoded in the current byte and the previous encoded 4 bytes. By finding the longest repeating context, LPAQ tries to predict the next bits. A neural network mixes the probabilities which were obtained from the mentioned contexts.

The neural network has a simple structure: It consists of an input layer with seven nodes and an output layer that computes the weighted average of the probabilities in the logistic domain. Such logistic mixing relies on the following logarithmic transformations:

The algorithm learns the weight for each context during the encoding process by a modified form of gradient descent that tries to minimise coding cost. This results in larger weights for contexts that give good predictions for the given input data. The algorithm refines this probability further using two SSE (Secondary Symbol Estimation) stages. Each SSE stage takes an input probability and then stretches and quantises them. The encoder interpolates the output using a context table to give an adjusted probability. Finally, we use this probability to encode the next bit with arithmetic coding.

However, an individual context can have a different amount of influence on predicting the next bit depending on the local statistics. Therefore, using the same neural network to encode all bits can be sub-optimal. To this end, LPAQ2 maintains 80 neural networks and chooses which one to employ based on combined match lengths of the prediction context and already encoded bits. Variants of PAQ have been utilised successfully in many inpainting-based compression approaches, primarily for storing quantised pixel values [3, 6, 7, 30, 32] . Since the PAQ family of codecs is so successful, we dedicate Section 6 to a detailed, separate analysis for them.

In Sections 2, 3 and 4, we have covered all the ingredients needed to construct sparse binary image compression methods. But before discussing specialised approaches to compress sparse binary images, we consider well-established image compression methods to provide a performance baseline for the specialised codecs.

We compare five popular lossless image compression methods: PNG [19] , JBIG [18] , DjVu [21] , JBIG2 [20] , and BPG-lossless [22] . We choose PNG as it is the most widely used lossless natural image compression method and BPG as it is the state-of-the-art in natural image compression. We choose JBIG and JBIG2 as they specialise in compressing binary images and finally DjVu as it has components that are built for binary image compression.

In Fig. 5 , we see that on masks derived from the Kodak dataset, PNG compresses the fastest as it combines simple prediction schemes with Huffman coding. DjVu performs best w.r.t. file size reduction. JBIG and JBIG2 uses similar methods to compress non-text data which explains their comparable performance and compression times. Both give a good trade-off between speed Figure 6 : Local contexts for our version of PAQ. X is the pixel currently being coded. The mth-order context corresponding to pixel X are the pixels that have an index less than or equal to m where 1 ≤ m ≤ 12. and performance. BPG yields inferior results because our sparse test images violate BPG's assumptions on natural images.

The previous section compared several popular general-purpose image compression methods to establish a baseline. In this section however, we look into specialised compression methods for sparse binary images. We have introduced a wide range of coding strategies that are viable candidates for good performance on binary images in Sections 4 and 5. However, context mixing is not only the most sophisticated coding strategy, it also has a track record of many applications in inpainting-based compression [3, 6, 7, 30, 32] . Therefore, we investigate which of the many components of these complex methods are useful for our purpose.

In the following we describe our setup for a detailed ablation study for context mixing on sparse binary images. In Fig. 4 , we saw that the vector and RLE representations were best suited for compression. Therefore, we pursue two approaches involving these representations: In a bottom-up strategy, we combine established as well as new contexts in a novel, minimalistic context mixing codec involving the vector representation. Furthermore, we deconstruct LPAQ2 in a top-down approach to compress run-lengths.

The reference point that we start with is the codec used by Marwood et al. [8] . It employs a global context where the probability of the next bit being a 1 is estimated as V r /N r where the remaining number of significant pixels to be encoded is denoted by V r and the total remaining pixels to be coded is N r . From now on, we will refer to this probability as the Marwood context probability. We also considered the context proposed by Demaret et al. [5] , where the context is defined by the number of non-zero pixels in the 12 neighbouring pixels to the current pixel that were already encoded. However, we found that even though this performs well on its own, it does not offer any performance gain when combined with our models.

We implemented a lightweight codec BPAQ-2D-S (Skeletal PAQ for 2D binary data) inspired by the context mixing structure of Mahoney [23] where we have only the first order local context (neighbouring pixel to the left) in addition with the Marwood context.

The set of equations is given in Algorithm 1 where n 10 , n 11 , w 1 are the counts for 0 and 1 occurring in the first order context and its weight. The bit currently being encoded is denoted by x. A semi-stationary update for local contexts is done where we halve the count of the non-observed symbol. We update counts this way so that the context probabilities can adapt more quickly to local statistics [23] . V r is initialised to the number of non-zero pixels in the image and N r is initialised to the total number of pixels in the image.

When calculating p final , we make use of a dynamically weighted local probability, a statically weighted local probability, and a statically weighted global probability. The global probability is statically weighted due to the fact that the counts for 0 and 1 cannot be defined for the Marwood context as it is done for the local context. We noticed through experimental observations that combining static and dynamic weights gives a better performance than just using dynamic weights. We determined the static weights empirically. Finally, we encode the next bit with arithmetic coding using p f inal .

BPAQ-2D-M (More efficient PAQ for 2D binary data) is an extended version of BPAQ-2D-S. As shown in Fig. 6 , it has twelve local contexts instead of one. Increasing the order from m to m + 1 adds one pixel to the contextneighbourhood and allows a better adaptivity to local statistics.

The difference in expressions from BPAQ-2D-S (Algorithm 1) is given in Algorithm 2 where n i0 , n i1 , w i are the counts for 0 and 1 occurring in the ith context and its corresponding weight, where 1 ≤ i ≤ 12. The static local probability here comes from the fourth order context instead of the first context in BPAQ-2D-S. Again, it was found out experimentally that the fourth order context worked best. Apart from these changes, the rest of the expressions stay the same as in BPAQ-2D-S.

BPAQ-2D-L (Logistic PAQ for 2D binary data) is similar to BPAQ-2D-M that uses logistic mixing instead of linear mixing where the probabilities are first stretched and then the final probability is squashed. Unlike BPAQ-2D-M where the final probability is obtained by statically mixing the mixed local context probability and the Marwood context probability, here the final probability is calculated dynamically where the weights change in each step for all contexts.

In Algorithm 3, p 1 ,. . . ,p 12 are the probabilities from the 12 local contexts and p 13 from the Marwood context. n i0 , n i1 , w i are the counts for 0 and 1 occurring in the i-th context and its weight. The bit currently being encoded is denoted by x, and the learning rate α is set to 0.02.

Algorithm 1 BPAQ-2D-S 1: Compute evidence for the next bit being 0:

S 0 = ε + w 1 n 10 2: Compute evidence for the next bit being 1:

3: Compute total evidence: S = S 0 + S 1 4: Compute weighted local probability that the next bit is 1:

Compute unweighted local probability from the first order context: 

10: Compute final probability:

Algorithm 2 BPAQ-2D-M 1: Compute evidence for the next bit being 0:

2: Compute evidence for the next bit being 1:

3: Compute unweighted local probability from the fourth order context:

4: Update context weights:

Finally, we have BPAQ-2D-XL (eXtended Logistic PAQ for 2D binary data) which is derived from BPAQ-2D-L where the only change is the structure of the local contexts. In the previous models, the higher order contexts were always contiguously extended from the lower context as seen in Fig. 6 . Here we removed the contiguity conditions and generated contexts that could be disconnected. We only consider the four nearest pixels to the current pixels, and consequentially we have four different first order contexts, six different second order contexts, four different third order contexts and one fourth order context. Therefore, we have 15 local contexts in total. The main differences between the proposed bottom-up PAQ models are listed in Table 2 .

In Fig. 7 , we see that the performance mostly correlates with the complexity of our proposed models. The simplest model, BPAQ-2D-S, is the fastest one and BPAQ-2D-XL the slowest. Surprisingly, BPAQ-2D-L outperforms all other approaches.

BPAQ-2D-M performs better w.r.t. compression ratio than BPAQ-2D-S due to the additional contexts at the cost of speed. BPAQ-2D-L gains additional compression efficiency, but the use of logarithmic and exponential functions in logistic mixing slows down the method further. The additional contexts of BPAQ-2D-XL do not only increase its complexity, but even slightly increase the cost per mask point. Thus, one should either choose BPAQ-2D-L for the lowest bit cost per mask point or the method of Demaret et al. [5] for a good mix of compression ratio and speed.

Algorithm 3 BPAQ-2D-L 1: Compute context probabilities:

2: Stretch context probabilities:

3: Combine stretched probabilities:

: Compute the final probability by squashing:

5: Update context weights:

6: Semi-stationary update of local context counts:

Update counts for the Marwood probability:

Uses 1 local context with linear mixing BPAQ-2D-M Uses 12 local contexts with linear mixing BPAQ-2D-L BPAQ-2D-M but with logistic mixing BPAQ-2D-XL BPAQ-2D-L but with non-contiguous contexts 

LPAQ2 is a very good entropy coder for a wide range of data and already constitutes a lightweight version of PAQ. However, it is unclear if all of the individual steps of the method really contribute to its performance. In order to identify these parts, we eliminate features step-by-step in a top-down manner to see the performance and runtime behaviour.

We start out with the full original implementation of LPAQ2 with parameter settings for highest compression (compression level 9). From there on, we perform multiple reduction steps.

Step 1: We remove the prediction context (no prediction model).

Step 2: We reduce the amount of neural nets from 80 to a single one (single net model).

Step 3: We retain only the intra-byte context and discard the rest (intra only model). Step 4: For ULPAQ (ultra lite PAQ), we also discard the second SSE stage (RLE + ULPAQ).

A comparison for the aforementioned reduction steps is shown in Fig. 8 . As the most complex model, the original LPAQ is the slowest of the evaluated methods. The no prediction model and the single network model only have a minor difference in bits per mask pixel, but they are about 20% faster than the original version. This shows that a single neural net without prediction comes at virtually no disadvantage to the more complex full version of the codec. The intra only model and ULPAQ are even 10 times faster than the original version. Additionally, ULPAQ performs better than intra only model. Thereby, the second step of SSE is detrimental for binary input data. For reference, we also consider LPAQ level 0, the fastest parameter setting for standard LPAQ. It is not as fast as RLE+ULPAQ, but also produces smaller encoded files. In practice, ULPAQ offers a slight advantage in speed while LPAQ (Level 0) provides a slight advantage in compression ratio.

In this section we perform a consolidated comparison of the best candidates for sparse binary mask compression from the previous Sections 3, 4, 5 and 6. Our evaluation is based on the two essential mask characteristics: Point distribution and density. Both of these aspects are relevant for practical use. The distribution depends on the inpainting operator and mask selection strategy, while the density is closely connected to the desired final compression ratio of the inpainting method. Thus, any best practice recommendations need to take into account the behaviour of the algorithms w.r.t. these two factors.

The different methods that we considered to generate the mask points have varying point distributions; see Fig. 3 . We want to investigate if compression performance changes with respect to the point distribution. Here we present the results averaged over all densities from 1% to 10% for different point distributions.

From Fig. 9 we observe that the relative rankings of different compression methods change over different types of distributions. For homogeneous diffusion masks, we see that DjVu offers both fast compression and a very good compression ratio. If runtime is no concern, BPAQ-2D-L provides the lowest cost of bits per mask point.

For Shepard and random masks, we see that the two best methods on homogeneous diffusion masks, BPAQ-2D-L and DjVu, perform worse. This is due to the fact that masks derived from homogeneous diffusion are more structured. Both BPAQ-2D-L and DjVu can therefore benefit from their ability to detect such patterns. This structure is absent from the masks derived using Shepard interpolation and, in the extreme case, completely random mask distributions. This is also why all the codecs perform better on homogeneous diffusion masks than on Shepard and random masks. In such a case, the approach of Demaret et al. [5] is a good alternative.

Next we evaluate if the relative performance of the selected codecs differ for varying densities. We present the results averaged over all point distributions for masks having densities of 1%, 5%, and 10%. In addition, we also consider an average over all point distributions and all point densities between 1% and 10%. Fig. 10 shows that the relative rankings between codecs are consistent over increasing densities and also in the average of all densities. This implies that the choice of the binary mask compression algorithm has to be only adapted to the distribution but not to the compression ratio.

For time critical applications, RLE + ULPAQ is the most suitable combination. BPAQ-2D-L from our ablation study of context mixing, yields the best compression performance on average at the cost of runtime. Between those two extremes, the method of Demaret et al. provides a good balance between speed and performance.

Additionally, a general trend can be inferred from Fig. 4 : The relative cost of storing positions declines for denser masks. This shows that storing mask positions will decrease the compression performance by a larger amount for sparse masks than dense masks.

The first systematic study of sparse binary image compression allows us to understand, compare, and improve upon a wide variety of codecs. With an ablation study, we have determined the aspects of context mixing methods that make them successful for the compression of sparse binary images. As a consequence of the ablation study, we proposed two different classes of methods: one to compress the image itself and the other to compress run-lengths. 10% masks averaged over 1% to 10% Masks Figure 10 : Performance vs. time comparison of different coders tested on different densities averaged over masks derived from the Kodak image dataset using sparsification + homogeneous diffusion, densification + Shepard interpolation, and random masks.

Our best practice recommendations provide a foundation for future inpainting-based codecs that use unconstrained masks. Our results from Section 7 suggest that optimal mask compression should adapt to the specific point distribution caused by the combination of inpainting operator and mask selection method. However, the compression method does not need to be changed based on density since that does not impact the relative rankings of the codecs.

Based on our full evaluation, we have concrete recommendations for different use cases: RLE + ULPAQ for time-critical applications, BPAQ-2D-L for the highest amount of compression for structured masks, and the codec of Demaret et al. [5] for the compression of unstructured or random masks. The approach of Demaret et al. can also be used to compress structured masks to obtain a good compromise between compression performance and speed.

These results have interesting implications for future codecs that use unconstrained mask based inpainting methods. Due to the increased relative cost of storing very sparse unconstrained masks, we can infer that this is viable only when they offer a massive improvement in inpainting quality over regular or structured masks. It might also be interesting to consider codecs that produce unconstrained masks which take into account not only inpainting quality, but also the cost of storing them.

JPEG: Still Image Data Compression Standard

Overview of the high efficiency video coding (HEVC) standard, IEEE Transactions on Circuits, Systems and Video Technology

Image compression with anisotropic diffusion

How to choose interpolation data in images

Contextual image compression from adaptive sparse data representations

Understanding, optimising, and extending data compression with anisotropic diffusion

Evaluating the true potential of diffusion-based inpainting in a compression context

Representing images in 200 bytes: Compression via triangulation

Fast inpainting-based compression: Combining Shepard interpolation with joint inpainting and prediction

Image compression by linear splines over adaptive triangulations

Optimising spatial and tonal data for homogeneous diffusion inpainting

An optimal control approach to find sparse data for Laplace interpolation

A bi-level view of inpainting-based image compression

Advanced Concepts for Intelligent Vision Systems

The Mathematical Theory of Communication

Object recognition from local scale-invariant features

Speeded-up robust features (SURF)

Information technology -progressive lossy/lossless coding of bi-level images, Standard, International Organization for Standardization

PNG (Portable Network Graphics) specification version 1.0, rFC 2083, Status: Informational

The emerging JBIG2 standard

High quality document image compression with DjVu

Proceedings of the 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference

Adaptive Weighing of Context Models for Lossless Data Compression

Iterative Methods for Sparse Linear Systems

Kodak true color image suite

Basic theory of pattern observation

Denoising by inpainting

A two-dimensional interpolation function for irregularly-spaced data

Compression of depth maps with segment-based homogeneous diffusion

EEDC image compression using Burrows-Wheeler data modeling

Turning diffusion-based image colorization into efficient color compression

Sketch based coding of grey level images

Image coding using weak membrane model of images

Segmentation-based non-texture image compression framework using anisotropic diffusion models

Two image compression schemes based on image inpainting

Image compression based on spatial redundancy removal and image inpainting

Edge-based compression of cartoon-like images with homogeneous diffusion

Image compression based on PDEs

Efficient depth map compression based on lossless edge coding and diffusion

A scalable coding approach for high quality depth image compression

Compressing flow fields with edge-aware homogeneous diffusion inpainting

Optimising spatial and tonal data for PDE-based inpainting

Discrete Green's functions for harmonic and biharmonic inpainting with sparse atoms

Algorithms for piecewise constant signal approximations

On high-resolution adaptive sampling of deterministic signals

Genetic normalized convolution

On the convergence of a linesearch based proximal-gradient method for nonconvex optimization

Anisotropic Diffusion in Image Processing

Visualization and Processing of Tensor Fields

Extreme image completion

Exemplar-based interpolation of sparsely sampled images

Pseudodifferential inpainting: The missing link between PDE-and RBF-based interpolation

A method for the construction of minimum redundancy codes

Arithmetic coding, in: Introduction to Data Compression

Arithmetic coding for data compression

Data compression using adaptive coding and partial string matching

The PAQ1 Data Compression Program