Sequence-specific minimizers via polar sets Sequence-specific minimizers via polar sets Hongyu Zheng1, Carl Kingsford1, and Guillaume Marçais∗1 1Computational Biology Department, Carnegie Mellon University, Pittsburgh, USA February 10, 2021 Abstract Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford- group/polarset. 1 Introduction The minimizer (Roberts et al., 2004a,b) methods, also known as winnowing (Schleimer et al., 2003), are methods to sample positions or k-mers (substrings of length k) from a long string. Thanks to its versatility, this method is used in many bioinformatics programs to reduce memory requirements and computational resources. Read mappers (Li and Birol, 2018; Jain et al., 2020b,a), k-mer counters (Erbert et al., 2017; Deorowicz et al., 2015), genome assemblers (Ye et al., 2012; Chikhi et al., 2015) and many more (see Marçais et al. (2019) for a review) use minimizers. In most cases, sampling the smallest number of positions, as long as a string is roughly uniformly sampled, is desirable as it leads to sparser data structures or less computation as fewer k-mers need to be processed. Minimizers have such a guarantee of approximate uniform sampling: given the parameters w and k, it guarantees to select at least one k-mer in every window of w consecutive k-mers. It achieves this goal by selecting the smallest k-mer (the “minimizer”) in every w-long window, where smallest is defined by a choice of an order O on the k-mers. Even though every minimizer scheme satisfies the constraint above, depending on the choice of the order O the total number of selected k-mers may vary significantly. Consequently, research on minimizers has focused on finding orders O that obtain the lowest possible density, where the density is defined as the number of selected k-mers over the length of the sequence. In particular, most research concentrates on the average case: what is the lowest expected density given a long random input sequence? (Marçais et al., 2017, 2018; Ekim et al., 2020; Orenstein et al., 2016). In practice, many tools use a “random minimizer” where the order is defined by choosing at random a permutation of ∗To whom correspondence should be addressed. gmarcais@cs.cmu.edu 1 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://github.com/kingsford-group/polarset https://github.com/kingsford-group/polarset https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ all the k-mers (e.g., by using a hash function on the k-mers). This choice has the advantage of being simple to implement and providing good performance on the average case. Here we investigate a different setup that is common in bioinformatics applications. Instead of the average density over a random input we try to optimize the density for one particular string or sequence. When applying minimizers in computational genomics, in many scenarios the sequence is known well in advance and it does not change very often. For example, a read aligner may align reads repeatedly against the same reference genome (e.g., the human reference genome). In such cases, optimizing the density on this specific sequence is more meaningful than on a random sequence. Moreover, the human genome has markedly different properties than a random sequence and optimization for the average case may not carry over to this specific sequence. In the read aligner example, a minimizer with lower density leads to a smaller index to save on disk and fewer seeds to consider in the seed-and-extend alignment algorithm while preserving the same sensitivity thanks to the approximate uniform sampling property. The idea of constructing sequence sketches tailored to a specific sequence has been explored before (Chikhi et al., 2015; DeBlasio et al., 2019; Jain et al., 2020b), but it remains less understood than the average case. Random sequences have nice properties that allow for simplified probabilistic analysis. Consequently, different analytic tools are needed to analyze sequence-specific minimizers. In fact, minimizers designed to have low density in the average case often offer only modest improvements on sequences of interest such as reference genomes (Zheng et al., 2020a). The current theory for minimizers with low density in average is tightly linked to the theory of universal hitting sets (UHS) (Orenstein et al., 2016; Marçais et al., 2018; Kempa and Kociumaka, 2019). As the name suggests, a UHS is a set of k-mers that “hits” every w-long window of every possible sequence (hence the universality; it is an unavoidable set of k-mers). Universal hitting sets of small size generate minimizers with a provable upper-bound on their density. Universal hitting sets are less useful in the sequence-specific case as the requirement to hit every window of every sequence is too strong, and UHSs are too large to provide a meaningful upper-bound on the density in the sequence-specific case. New theoretical tools are needed to analyze the sequence-specific case. Frequency-based orders are examples of sequence-specific minimizers (Chikhi et al., 2015; Jain et al., 2020b). In these constructions, k-mers that occur less frequently in the sequence compare less than k- mers that occur more frequently. The intuition is to select rare k-mers as they should be spread apart in the sequence, hence giving a sparse sampling. This intuition is only partially correct. First, there is no theoretical guarantee that a frequency-based order gives low density minimizers, and there are many theoretical counter- examples. Second, in practice, frequency-based orders often give minimizers with lower density, but not always. For example, Winnowmap (Jain et al., 2020b) uses a two-tier classification (very frequent vs. less frequent k-mers) as it performs better than an order strictly following frequency of occurrence. Another approach to sequence-specific minimizers is to start from a UHS U and to remove as many k-mers from U as long as it still hits every w-long window of the sequence of interest (DeBlasio et al., 2019). Because this procedure starts with a UHS that is not related to the sequence, the amount of possible improvement in density is limited. Additionally, given the exponential growth in size of the UHS with k, current methods are computationally limited to k ≤ 15, which is limiting in many applications. The construction proposed here takes a different approach and introduces polar sets. The polar sets concept can be seen as complementary to the universal hitting sets: while a UHS is a set of k-mers that intersects with every w-long window at least once, a polar set is a set of k-mers that intersect with any window at most once. The name “polar set” is an analogy to a set of polar opposite magnets that cannot be too close to one another. That is, our construction builds upon sets of k-mers that are sparse in the sequence of interest, and consequently the minimizers derived from these polar sets have provably tight bounds on their density. Our main contribution is Theorem 1 that gives an upper bound and a lower bound on the density obtained by a minimizer created from a polar set. These bounds are expressed in term of the “total link energy” of the polar set on the given sequence. The link energy is a new concept that measures how well spread apart the elements of the polar sets are in the sequence: the higher the energy, the more spread apart the k-mers are. Then we show that the link energy is almost exactly the improvement in density one gains from using a minimizer created from the polar set compared to a random minimizer. In the following sections we also show that the problem of finding a polar set with maximum total link energy is, unsurprisingly, NP-hard, and we describe a heuristic to create polar sets with high total link energy. 2 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Finally, we show that our implementation of this heuristic generates minimizers that have specific density on the human reference genome much lower than any other previous methods, and, for some parameter choices, relatively close to the theoretical minimum. 2 Methods 2.1 Overview We set the stage by defining important terms and concepts, then giving an overview of the main results, which are then proved formally in the following sections. The sequence S is a string on the alphabet Σ of size σ = |Σ|. The parameters k and w define respectively the length of the k-mers and the window size. We assume that S is relatively long compared to these parameters: |S|� w + k. Definition 1 (Minimizer and Windows). A minimizer is characterized by (w,k,O) where O is a complete order of Σk. A window is a sequence of length (w + k − 1) consisting of exactly w k-mers. Given a window as input, the minimizer outputs the location of the smallest k-mer according to O, breaking ties by preferring the leftmost k-mer. The minimizer (w,k,O) is applied to the sequence S by finding the position of the smallest k-mer in every window of S. Because two consecutive windows in S have a large overlap, the same k-mer is often selected in these two windows, hence the minimizer returns a sampling of positions in the sequence S. The specific density of the minimizer on S is defined as the number of selected positions over the length |S|. The density is between 1/w, because at least one k-mer in every window must be picked, and 1, because it is a sampling of the positions of S. Therefore the goal is to find orders O that have a density as close to 1/w as possible. A minimizer with density 1/w is a perfect minimizer. For simplicity, when stating the density of a minimizer we ignore any additive term that is o(1/w) (i.e., asymptotically negligible compared to 1/w). A random minimizer is defined by choosing at random one of the permutations of all k-mers. The expected density of a random minimizer is 2/(w + 1) (Schleimer et al., 2003; Roberts et al., 2004b; Zheng et al., 2020a). Equivalently, the expected distance between adjacent selected k-mers is (w + 1)/2. The random minimizers will serve as a baseline to compare to. Defining orders. For practical reasons, we define orders by defining a set U and considering orders that are compatible with U: an order O is compatible with U if for O every element of U compares less than any element not in U. That is, only the smallest elements for O are specified (the elements of U) and a minimizer using an order compatible with U will preferentially select the elements of U. There exist many orders that are compatible with U as the relative order between the elements within U is not specified. Universal Hitting Sets. A set U is a universal hitting if for every one of σw+k−1 possible windows (recall σ is the size of the alphabet), it contains a k-mer from U. In the average case, minimizers compatible with U have densities upper bounded by |U|/σk, because only k-mers from the universal hitting set can be selected. Supplementary Section S2 provides a more detailed discussion of why this bound provided by universal hitting sets does not always apply for sequence-specific minimizer analysis, and why universal hitting sets do not specialize well. Short sequences. On a short random sequence (in a sense made precise by Lemma 1) most k-mers are unique (i.e., they occur only once in the sequence S). Therefore, it is likely that there is a set U of unique k-mers of S that are exactly w bases apart in S, and a minimizer compatible with U is perfect. Unfortunately most sequences of interest (e.g., reference genomes) are too long, too repetitive and in general do not satisfy the hypothesis of Lemma 1. For most sequences it is not possible to find a set of “perfect seeds” of k-mers spaced exactly w apart. 3 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Polar sets. An polar set is a relaxed version of a perfect set: any pair of k-mers m1 and m2 from a polar set A are always more than w/2 bases apart in S (see the more general Definition 2). The intuition behind this definition is that for a minimizer compatible with A, any k-mer from A selected by the minimizer is at distance ≥ (w + 1)/2 from the previous and the next selected k-mer. Hence, k-mers selected from A are at least as sparse, and usually more sparse than k-mers selected using a random minimizer in expectation. Section 2.2 gives a formal definition of the link energy of a polar set and Theorem 1 gives upper and lower bounds using this link energy for the density of a minimizer compatible with a polar set. This theorem shows that the link energy of the polar set A is a measure of how much reduction in density is obtained by using a minimizer compatible with A rather than a random minimizer. Hence, designing a polar set with high link energy is a method to find minimizers with provably low density. Section 2.3 introduces layered polar sets, which are an extension to polar sets, and builds a heuristic method to create such sets. 2.2 Polar sets and link energy 2.2.1 Key Definitions k-mers in UHS Polar k-mers (slackness s=0) All windows contain at least one UHS k-mer Some windows have multiple All windows contain at most one polar k-mer Some windows have none Perfect Seeds All windows contain exactly one seed k-mer Figure 1: Comparing universal hitting sets, perfect seeds (compatible minimizers become perfect minimizers) and polar sets. Each block indicates a k-mer, and each segment indicates a window of length 5 (w = 5). To provide a better contrast with universal hitting sets, we show polar sets with slackness s = 0 (see Definition 2). We now define polar sets, the key component for our proposed methods. Definition 2 (Polar set). Given sequence S and parameters (w,k,s) with 0 ≤ s < 1/2, a polar set A of slackness s is a set of k-mers such that every two k-mers in A appears at least (1 −s)w bases apart in S. This can be viewed as a complementary idea to the universal hitting sets or a relaxed form of perfect sets. As discussed in the introduction, a universal hitting set requires the set to hit every w consecutive k-mers at least once, while a polar set with s = 0 requires the set to hit every w consecutive k-mers at most once. A set of perfect seeds, if it exists, is both a polar set with zero slackness and a universal hitting set. See Figure 1 for a more concrete example. The condition s < 1/2 is critical for our analysis. Specifically, this condition is required to obtain a lower bound on the specific density of compatible minimizers, not just an upper bound. Definition 3 (Link energy). Given sequence S, parameters (w,k) and a polar set A, if two k-mers on S are l ≤ w bases apart and are both in A, the link energy of the pair is defined as 2l/(w + 1) − 1 ≥ 0. The total link energy of A is the sum of link energy across all eligible pairs. 4 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Any two k-mers from A in S must be more than w/2 bases apart, so two k-mers cannot form a link if there is a third k-mer from A between them. With s = 0, the link energy is fixed to be 2w/(w + 1) − 1 = 1 − 2/(w + 1) ≈ 1 for each eligible pair, and the total link energy is approximately the number of pairs that form a link, which in turn is the number of k-mer pairs in the polar set that are exactly w bases away on S. In the following sections, we introduce and discuss the backbone of the polar set framework, which revolves around closer inspection of how a random minimizer works on a specific sequence, and drawing contrast between sequence-specific minimizers and non-sequence-specific minimizers. We use the term “non- sequence-specific minimizers” to refer to constructions of minimizer that does not specifically target a certain sequence, but rather aim to minimize density, the expected specific density on a random string. 2.2.2 Perfect Minimizer for short sequences A perfect minimizer is a minimizer that achieves density of exactly 1/w. While the only known examples of perfect minimizers are in the asymptotic case where w � k (Marçais et al., 2018), perfect sequence-specific minimizers exist with high probability for short sequences. Lemma 1. If |S| < (1 − �) √ wσk/2, with at least � probability a random sequence of length |S| has a perfect minimizer. Proof. The optimal minimizer is constructed with fixed interval sampling. More specifically, we take every w k-mer in S and denote the resulting k-mer set U, then construct a minimizer compatible with U. The resulting minimizer is perfect if and only if the k-mers in U only appear in the selected locations. There are |S|/w selected locations and (1 − 1/w)|S| locations not selected, and for each pair of selected and not selected locations, the k-mer at these two locations are identical with probability σ−k (see Supplementary Section S1). By union bound, the probability that the sequence violates the polar set condition is at most |S|2σ−k/w < (1−�)2, and the sequence has a perfect minimizer with probability at least 1−(1−�)2 > �. 2.2.3 Context Energy and Energy Savers Contexts provide an alternative way to measure the density of a minimizer (Zheng et al., 2020a). These play a central role on the analysis of polar sets. Definition 4 (Charged Contexts). A context of S is a substring of length (w + k), or equivalently (w + 1) consecutive k-mers, or equivalently 2 consecutive windows. A context is charged if the minimizer selects a different k-mer in the first window than in the second window. See top left of Figure 2 for examples of charged contexts. Intuitively, a charged context corresponds to the event that a new k-mer is picked, and counting picked k-mers is equivalent to counting charged contexts. Lemma 2 (Specific Density by Charged Contexts). For a given sequence S and a minimizer, the number of selected locations by the minimizer equals the number of charged contexts plus 1. Given a context c, define E(c) as the probability that c will be charged with a random minimizer (one with a random ordering of k-mers), which we call the energy of c. Lemma 3. The expected number of picked k-mers in S under a random minimizer is 1 + E0(S), where E0(S) = ∑ c E(c) is called the initial energy of S and the summation is over every context of S. This is proved by combining the linearity of expectation and Lemma 2. This implies that the total energy of a sequence is directly related to the specific density of random minimizers, which is number of picked locations in S divided by number of k-mers in S. E(c) admits a simple formula: Lemma 4. E(c) = 2/u(c) if the last k-mer in the context is unique, 1/u(c) otherwise, where u(c) denotes the number of unique k-mers in c. 5 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Proof. Consider an imaginary minimizer with w′ = w+ 1 and identical k. The context of a (w,k)−minimizer is a window of the imaginary minimizer, and it is charged if and only if the imaginary minimizer picks either the first or the last k-mer. If the imaginary minimizer does not pick either end, the two constituent windows of the context share the same minimal k-mer, and the context is not charged. With a random minimizer, the probability that the first k-mer is picked in the imaginary window is 1/u(c). The probability that the last k-mer is picked is 1/u(c) if the last k-mer is unique, 0 otherwise, because the minimizer break ties by preferring leftmost k-mer. The two events are mutually exclusive, so E(c) is the sum of these two terms. If all k-mers in a context are unique, E(c) = 2/(w + 1) is guaranteed, which we call the baseline. If this holds for all windows, a random minimizer will have specific density of 2/(w + 1), similar to applying a random minimizer to a random sequence. As lower u(c) only increases E(c), E(c) < 2/(w + 1) only if the last k-mer in c is not unique and there are over (w + 1)/2 unique k-mers in the context. Definition 5. A context c is called an energy saver if E(c) < 2/(w + 1), and its energy deficit is defined as 2/(w + 1)−E(c). The energy deficit of S, denoted D(S), is the total energy deficit across all energy savers: D(S) = ∑ c max(0, 2/(w + 1) −E(c)). In general, the value of D(S) is very small due to the fact that energy saver contexts (those with E(c) < 2/(w + 1)) are rare. Lemma 5. For a random context, the probability that it is an energy saver is at most wσ−k. Proof. We bound the probability that the last k-mer in a context is not unique. The probability that the last k-mer equals a specific k-mer in another location is σ−k (see Supplementary Section S1). Applying union bound over w other k-mers (as each context has (w + 1) k-mers) we get the desired result. There are examples of sequences where energy saver contexts are abundant. An extreme scenario is when the sequence S is has a period of w, and has w distinct k-mers. In this case, all contexts become energy saver contexts. These scenarios are rare in practice. Similarly, we can define energy spenders and energy surplus as follows: Definition 6. A context c is called an energy spender if E(c) > 2/(w + 1), and its surplus is defined as E(c) − 2/(w + 1). The energy surplus of S, denoted X(S), is the total energy surplus across all energy spenders: X(S) = ∑ c max(0,E(c) − 2/(w + 1)). Contexts with energy surpluses are more common than energy savers, but still fairly rare in a random sequence with suitable choice of w and k: Lemma 6. For a random context, the probability that it is an energy spender is at most w(w + 1)σ−k/2. Proof. A context becomes an energy spender if the last k-mer is unique, and some k-mers appears twice. We bound the probability that some k-mers in the context appear twice. Following previous arguments, any two k-mers in a given context are identical to each other with probability σ−k, and we apply a union bound of size w(w + 1)/2 (enumerating over pairs of k-mers) to obtain the desired result. 2.2.4 Density Bounds with Polar Sets With the proper tools, we now state the main theorem of the Polar Sets. Theorem 1. Given a sequence S and a polar set A on S, let E0(S) be the initial energy of S, D(S) be the total energy deficit, X(S) be the total energy surplus, and L(S,A) be the total link energy from the polar set. The number of selected k-mers over S for a random minimizer compatible with A is at most 1 + E0(S) + D(S) −L(S,A), and at least 1 + E0(S) −X(S) −L(S,A). Proof. We first prove the upper bound part. We start by elevating the energy of every energy saver context to the baseline 2/(w + 1). By definition, this increases the total energy of S by D(S), so number of selected k-mers is now upper bounded by 1 + E0(S) + D(S). Formally, ∑ E(x) ≤ 1 + E0(S) + D(S). 6 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ A context has two windows. always charged never charged Contexts without polar k-mers (Singleton: 2 contexts charged, 6 covered, L=0) Charged if different k-mer selected in these windows. Not charged otherwise. minimizer selection A B C 2 3 4 could be charged (Linked Polar k-mers: 2+2=4 contexts charged, 13 covered, L=1/3) Contexts with polar k-mers as random minimizer 1 2 out of 4 is charged. ex. A D(S). For a ballpark estimate, we assume S is a random sequence, and assume the slackness parameter s = 0 in construction of the polar set. In this setup, each link has exactly 1 − 2/(w + 1) ≈ 1 energy. As seen in Lemma 5, a context is an energy saver with probability wσ−k, and its deficit is at most 2/(w + 1) − 1/w ≈ 1/w, meaning D(S) ≈ σ−k|S|. This further means we need the number of links to be at least σ−k|S| to provably beat a random minimizer. On the other hand, ignoring the effect of D(S), in order to beat the specific density of a random minimizer by �/(w + 1), total link energy of �|S|/(w + 1) is needed. Assuming no slackness, this means the number of links need to be at least �|S|/(w − 1). Intuitively, � portion of the sequence needs to be covered by links between close enough k-mers in polar set. A proper polar set requires s > 1/2 for the main theorem to hold. When s ≤ 1/2, only the upper bound part of the theorem holds with an alternative definition of link energy. We will discuss the alternative definition in Section 2.3.4, and further discuss generalization of polar sets in Supplementary Section S2.4. 2.2.5 Hardness of Optimizing Polar Sets The link energy formulation of polar sets allows us to cast the problem in graph theoretical framework. Consider an undirected, weighted graph where every unique k-mer is a vertex. An edge connects two k-mers with the following: If these two k-mers ever appear within fewer than (1−s)w bases of each other in S, the weight is −∞. Otherwise, the weight of this edge is the total link energy by selecting only these two k-mers, which might establish several links given each k-mer may appear in S multiple times. There can also be self-loops with weights, given a k-mer may appear close to itself on the reference sequence. The problem of finding optimal polar sets becomes the problem of finding an induced subgraph with maximum weight. The general maximum induced subgraph problem is well known to be NP-hard via reduction from max- clique. In Supplementary Section S3, we provide an explicit proof that shows optimization of polar sets, even with an alphabet of three, is NP-Hard. 2.3 Constructing Polar Sets In this section, we propose a practical extension to polar sets, and formally introduce our heuristics. 8 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Final Coverage (Not-Covered k-mers) Layer 1 Layer 2 Layer 3 Selected and not-covered locations Covered locations Figure 3: Examples of layered polar sets, with three layers. Without layered polar sets, the k-mers from layer 2 and 3 could not be selected as in the polar set because of self-collision. The whole sequence is covered in this case (every window contains a polar k-mer from one layer). Layer 1 is the one with highest priority and our layered heuristics construct it first. 2.3.1 Layered Polar Sets Assume we have already constructed a polar set A that covers some segments of the reference sequence. Here, covered means that every window contains a k-mer from the polar set, or equivalently, A acts as a universal hitting set on these segments. Now, to cover the rest of the reference, we shall extend A so more k-mers become polar k-mers. It is natural to consider generating a polar set over the uncovered portion of the reference sequence, then merge this set with A. This however leads to problems. Let A′ be a polar set over the uncovered portion of the reference sequence. A∪A′ might not always be a valid polar set, because a k-mer m′ ∈ A′ may appear in the already-covered part of the reference sequence, and appear close to another k-mer m ∈ A, thus violating the polar set condition for A∪A′. On the other hand, the reason we set up the constraint for polar sets is to ensure that k-mers in the polar set will always be selected by any compatible minimizer. In other words, we want to ensure we know exactly the set of k-mers that will be selected. The issue was that m′ ∈ A′ might not always be selected by a compatible minimizer. However, from the perspective of constructing efficient minimizers, we do not need m′ to be selected everywhere, as in some places the reference sequence is already covered with k-mers in A. By forcing m < m′ for any m ∈ A, we ensure that m′ will only be selected outside the segments covered by A. Applying this argument to all k-mers in A′, we can essentially ignore the sequence segments already covered by A when constructing A′, as long as the ordering is satisfied. This gives a way to progressively construct the layers of polar sets: at each layer we only need to consider regions of the reference sequence that are not yet covered by previous layers. Formally: Definition 7. A layered polar set is a list of sets of k-mers {Ai}, for 1 ≤ i ≤ m. With slackness s, the layered polar set condition is satisfied if for any k-mer in Aj, for each of its appearance at location t in the reference sequence, either of the following holds: • It is at least (1 −s)w bases apart from any k-mer in {A1,A2, · · · ,Aj}. • It is covered: There are two k-mers in {A1,A2, · · · ,Aj−1} (importantly does not include Aj), appearing at location l and h, satisfying l < t < h and h− l ≤ w. Similarly, a compatible order for {Ai} is an order that places all k-mers from A1 first in arbitrary order, then those in A2, ..., then those in Am and finally those not in any of {Ai} in a random order. The link energy L({Ai},S) is similarly defined over the pairs of close k-mer appearances that are not covered. More formally: 9 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Definition 8. For a layered polar set, if two k-mers in the layered polar sets, not necessarily from the same layer, appear l ≤ w bases apart in S, and neither are covered, the link energy between them is 2l/(w+1)−1 > 0. L({Ai},S) is the total link energy across all pairs. These definitions of layered polar sets and link energy have two important properties. First the link energy is non-decreasing as more layers are added to the set. And, second, an almost identical argument proves the same bounds for layered polar sets as for polar sets in Theorem 1. See Figure 3 for a concrete example of layered polar sets. 2.3.2 Polar Set Heuristic We consider a simple heuristic to generate a polar set. The core idea is to select as many k-mers as possible from the set of k-mers that appear exactly w bases away from each other. We cannot select all of them as it may violate the polar set condition due to some k-mers appearing multiple times. Because reference sequences are long strings (in the range of billions of bases for mammalian genomes), we consider algorithms that scale well with the length of the reference sequence, preferably close to linear. Fix an offset o ∈ [0,w − 1], we start by listing all locations t such that t = o mod w in the reference sequence S. We then randomly shuffle the locations, and for each location t in this random order, add the k-mer at location t to the polar set. When we add a k-mer m to the polar set, we also locate and remove all k-mers in the polar set that appear fewer than (1−s)w bases away from m. Additionally, if a k-mer appears multiple times in the list, it is considered only once at the first encounter. This is to prioritize k-mers that appear less often; Frequent k-mers are expected to be processed early given their multiple occurrences, and are more likely to be absent in the final polar set as they have more chances to be removed due to conflicts. Jain et al. (2020b) has explored a similar idea in building tiered random minimizers using a biased hash function. Our algorithm also has a variant, which we call “monotonic”. In this variant, we require that adding a new k-mer m and removing the k-mers conflicting with m actually increases the link energy. Otherwise, the k-mer is skipped and no conflicting k-mers are removed. This variant is slower but results in more efficient polar sets. We filter k-mers before they are considered for addition to polar sets. k-mers that collides with itself (appears fewer than (1 − s)w bases away from its own copy) cannot be in the polar set. We also filter out k-mers by their frequency in the reference sequence (see Section 2.3.3 for the threshold value). Algorithm 1 Pseudocode for Polar Set Heuristics function PolarSet(S,w,k) Start with an empty set A ←{} and a random offset o Shuffle list of locations t = o mod w for 0 ≤ t < |S| for each t in the list and the k-mer mt at location t do Skip if mt is filtered, or has been processed previously Obtain list l of occurrences of mt via suffix array Obtain list of conflicting k-mers via linked blocks Remove all conflicting k-mers and add mt to the polar set A end for return A end function Algorithm 1 shows the pseudocode for the non-monotonic variant of the heuristic. The monotonic variant is similar. We describe the data structures in Section 2.3.4, and analyze the time complexity in Section 2.3.5. 2.3.3 Layered Heuristics and Hyperparameters We construct layered polar sets with a similar algorithm. The properties of layered polar sets guarantee that new layers cannot decrease the final link energy of the polar set. 10 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ We rerun the polar set heuristic multiple times, each time with a new random offset o. Each round is run with the current layers of polar sets, and the resulting polar set is added as a new layer. The algorithm for each layer is mostly identical to the single-layer version, with a few changes. • When processing a k-mer, we skip all of its occurrences that are covered by existing layers of the polar set. • We skip k-mers at non-covered locations t that is fewer than (1 − s)w bases away from a k-mer in a previous layer. These k-mers cannot be in the layer without violating the layered polar set condition. • At the end of each round, we remove all k-mers selected in the current layer that do not form a link with any k-mers. We also gradually increase the threshold of k-mer frequency at each round to prioritize low frequency k-mers. In our experiments, we use a total of 7 rounds, with last two rounds being monotonic. The frequency threshold is set at the value to include 85% of locations of the reference in the first round, gradually increasing to 95% in the last round. The slackness s is also a tunable parameter, which determines when a pair of k-mers is considered in collision. Lower value of s ensures the distance between adjacent polar k-mers are large and have higher link energy for every pair of linked k-mers, but results in smaller number of k-mers selected, implying fewer links. Higher value of s means larger polar sets covering more of the reference sequence and more links formed, but adjacent polar k-mers may be closer to each other resulting in lower energy per link. In our experiments, we use a fixed slackness s = 0.4 after parameter search. This results in approximately 20% less efficient links (average link energy compared to theoretical maximum), but higher total link energy due to inclusion of more links. A more thorough parameter tuning might suggest a gradually increasing value of s between rounds. 2.3.4 Supporting Data Structures Our heuristics require some data structures to operate efficiently both in theory and in practice. Suffix Array. In order to quickly index k-mers and obtain the list of occurrences of a k-mer, we precompute the suffix array, the inverse suffix array and the heights table (also known as the LCP array) of the reference sequence. All can be computed in linear time. This allows us to find the list of T locations that share the same k-mer as location t, in O(T) time. Linked Blocks. The layered polar set property ensures that in any stretch of w/2 bases, at most one k-mer at one location is selected into any layer of the layered polar sets, excluding covered locations. We use a data structure called linked blocks to represent the set of these selected locations of k-mers. Let h = bw/2c, we divide the locations in the reference sequence into h-long blocks, and use an array of length |S|/h to represent these blocks. Each value in the array C[b] is either −1, meaning there are no selected location within this block spanning location [bh, (b + 1)h), or a nonnegative integer j, indicating that the k-mer at location bh + j is selected. With linked blocks we can do the following operation quickly: Definition 9. PeekL(x) returns the closest selected location to the left of x, up to w bases. This is because we only need to query up to three blocks. Adding a location and removing a location also only involves a single block. Similarly we can define PeekR(x). With this data structure, we can implement many critical operations in the aforementioned heuristics. The step of filtering k-mers, more specifically determining whether a k-mer collides with itself, is done using this data structure, in similar fashion to bucket sorting. By maintaining two linked blocks, one for the current layer and one for all previous layers, we can determine whether a location is covered by the previous layers, and list collisions on the current layer. 11 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Calculating Link Energy. In the monotonic variant of our heuristics, we need to calculate the total link energy before and after adding a k-mer. In our implementation, we update the link energy of the polar set as we add and remove locations to the linked blocks, using the following alternative formula for link energy: L({Ai},S) = 2Acov/(w + 1) −Aele −Aseg. Here, Acov is the number of contexts that contain a k-mer from the polar set, Aele is the number of non- covered location of selected k-mers, and Aseg is the number of continuous segments of windows that contain a k-mer from the polar set. When adding and removing a location to the linked blocks, the changes to these three values are calculated using linked block primitives in constant time, so we can update the link energy in constant overhead. As a sanity check, we see that when adding an isolated k-mer, Acov increases by (w + 1) and the other two values increase by 1, resulting in a net link energy gain of zero, consistent with the original definition. We can also compute the link energy of the polar k-mers in bottom part of Figure 3 using this formula, where Acov = 13,Aele = 3 and Aseg = 1, resulting in the total link energy of 1/3. 2.3.5 Time Complexity Analysis We now analyze the time complexity of the layered polar sets heuristic, assuming no monotonic rounds for now. Let n be the length of the reference sequence, and assume a constant-sized alphabet. We assume a word of constant size can hold an integer in [0,n], and that accessing an element in an array of length n takes constant time. These conditions hold for genomes and 64-bit machines. This means the primitive operations on linked blocks take constant time, and operations involving the suffix array also take constant time. Consider a worst case scenario: By iterating k-mers that appear exactly w bases away from each other, we iterated over all k-mers in the reference sequence. Assume a k-mer m occurs T times in the reference sequence. In filtering phase, we first fetch the list of T locations in O(T) time using the suffix array, and we want to determine if there are two elements whose difference is less than (1 − s)w. This can be done using the linked blocks in O(T) time. In the case of layered polar sets, we also want to determine if each of the locations is covered by previous layers, and if it is fewer than (1 − s)w bases away from a location in a previous layer. As we use one linked block for all previous layers, this can be done in O(T) time. The filtering phase thus finishes in O(T) time. The main algorithm is split into three parts: detecting k-mers that are close to m in the reference sequence, removing those k-mers from the polar set, and adding m to the polar set. Detecting and listing k-mers that are close to m takes O(T) time, as each location reports only four collisions at most, two to the left and two to the right. Removing a k-mer that occurs T ′ times takes O(T ′) time, but since each k-mer is only added and removed once in one round, this amortizes to O(T) time. Adding m to the polar set also takes O(T) time. The singleton detection step (removing k-mers forming no links) also takes O(T) time for checking if m is a singleton. As each k-mer is only visited once in the main algorithm, and in the worst case scenario every k-mer in S is visited, we conclude that the layered polar set heuristics runs in ∑ O(T) = O(n) time for each layer, and as a special case the (non-layered) polar set heuristics runs in O(n). The monotonic variant of the heuristic can in theory run in O(n2) time, but it is not significantly slower in practice. 3 Results All the experiments are run using the human reference genome hg38. To facilitate the performance com- parison across a range of parameter values of w and k, we report the density factor (Marçais et al., 2017) instead of the density. The density factor is the density multiplied by (w + 1). Regardless of the value of w, the random minimizer has an expected density factor of 2 and a perfect minimizer has a density factor of ≈ 1. 3.1 Energy Deficit and Energy Surplus First, we calculate the average energy deficit X(S)/|S| and average energy surplus D(S)/|S|. The results are in Figure 4A. 12 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=10) 0.0100 0.0075 0.0050 0.0025 0.0000 0.0025 0.0050 0.0075 0.0100 D e n si ty f a c to r 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=100) 0.0100 0.0075 0.0050 0.0025 0.0000 0.0025 0.0050 0.0075 0.0100 D e n si ty f a c to r Zero Energy Surplus Energy Deficit A 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=10) 1.2 1.4 1.6 1.8 2.0 D e n si ty f a c to r 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=100) 1.0 1.2 1.4 1.6 1.8 2.0 D e n si ty f a c to r Random Minimizers Lower Bound Fixed Interval Sampling Miniception Layered Polar Sets B Figure 4: Left: Energy surplus and deficit for short (w = 10) and for long (w = 100) windows, computed on the human reference sequence hg38. The difference between the two lines is the difference between the upper and lower bound of Theorem 1. It is very small and the bounds are very good estimates in practice. Right: Density factor for the proposed methods, for short and long windows, computed on hg38. The bottom orange dashed line is the theoretical minimum density (perfect minimizers). The reference genome is more repetitive than a purely random sequence. However, empirically the energy surplus and deficit are still small, well below 0.01 measured in density factor, implying a relative error of at most 1% when estimating specific density with link energy. Thus, when constructing efficient minimizers by (layered) polar sets, using link energy to estimate specific density is efficient and accurate. For reference, on a random sequence the average energy surplus and deficit are below 10−7 in absolute value, for the parameter range we are interested in. 3.2 Evaluating Polar Set Heuristics We next evaluate our proposed algorithms for layered polar sets. We implemented the algorithm with Python3. Experiments are run in parallel and the longest ones finish within a day. The peak memory usage stands at 100 GB, which happens at the start loading the precomputed suffix array using Python pickle. We compare our results against some other candidates: • Random Minimizers. Achieves density factor of 2 in theory and in practice, as indicated in last section. • Lower Bound. This corresponds to the density factor for perfect minimizers. While our theory predicts existence of perfect minimizers matching the lower bound with large value of k, this rarely happens with practical parameter values. • Fixed Interval Sampling. This method uses every w k-mers from S as the set U to define a compatible minimizer. • The Miniception (Zheng et al., 2020a), a practical algorithm that provably achieves lower density in many scenarios. The hyperparameter k0 is set to max(5,k −w) for our experiments. We do not include existing algorithms for constructing compact universal hitting sets because these methods do not scale to values of k > 14. Our heuristics work the best when k-mers do not appear too frequently, or roughly speaking, when σk > n where n is the length of the reference sequence. This choice of parameter is common in bioinformatics analysis. With the sequence at the size of human reference genome, our heuristics work well starting at k = 15. Additionally, the Miniception achieves comparable performance with leading UHS-based heuristics, so its performance also serves as a viable proxy. We consider two scenarios, first with short windows (w = 10) and second with long windows (w = 100). The results are shown in Figure 4B. Our experiments indicate that our simple heuristics yield efficient mini- mizers, greatly outperforming random minimizers and the Miniception, while maintaining a consistent edge over fixed interval sampling methods, in both short windows and long windows settings. The improvement 13 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ is more pronounced when the windows are long. Given our layered polar set heuristics consist of multiple rounds, in Supplementary Section S5.1 we show the progression of density factors through rounds, demon- strating that the layered heuristics are particularly effective at low values of k. We next show that in building sequence-specific minimizers using layered anchor sets, we do not sacrifice their performance in the general case measured by (expected) density. In Supplementary Section S5.2, we sketch a random sequence using the sequence-specific minimizers we built for hg38. As expected, the performance closely matches that of a random minimizer. 4 Discussion 4.1 Limits and Future of Polar Sets While the concept of polar sets is interesting and leads to improvements in state-of-the-art sequence-specific minimizer design, we should acknowledge its limitations. First, it cannot be used in designing non-sequence- specific minimizers when w > k. Arguably, this means the method is more tailored for sequence-specific minimizers. See Supplementary Section S4 for proof and more discussion on non-sequence-specific polar sets. Our experimental results show that the performance of minimizers based on polar sets greatly improves as k grows. When each k-mer appears many times in the reference sequence, it becomes hard to select many k-mers without violating the polar set condition. For comparison, in Supplementary Section S5.3 we show the results when we apply the heuristics to human chromosome 1 sequence only, which is about 1/10 as long as the whole human reference genome. Improvements across the board for the heuristic algorithms and the fixed interval sampling methods are observed. The repetitiveness of human reference genome also means much more difficult optimization of specific density. In Supplementary Section S5.4, we show the results when we apply the heuristics to build sequence-specific minimizers on a random sequence that are as long as the chr1 sequence. It is significantly easier to reach the theoretical minimum specific density of 1/w in this setup compared to the previous one. With better computing power and more efficient algorithms, it is desirable to compute an optimal polar set. Thanks to our link energy formulation, the problem of optimal polar set can be formed with integer linear programming (ILP), each k-mer being a binary variable. For moderately-sized reference sequences, an optimal polar set can be found. However, no such convenient formulation exists for layered polar sets, and it is an interesting question whether there is a tractable optimization problem for minimizers in general. 4.2 Practicality of Sketches-by-Optimization The polar sets can be used wherever universal hitting sets are used, in most cases. Given that our heuristics for layered polar sets only produce a small number of layers, implementation of a compatible minimizer with layered polar sets is not fundamentally different from that with a universal hitting set. The fixed interval sampling method is very similar to previously proposed methods (Khiste and Ilie, 2015; Almutairy and Torng, 2018; Frith et al., 2020), where the sketch of a reference sequence is simply the set of k-mers appearing at locations divisible by w. Polar sets might not be able to directly replace fixed interval sampling, however it can be readily expanded into a set of seeds that covers the whole reference sequence. These approaches are currently relatively underused, compared to more traditional approach of minimiz- ers like lexicographical, random or slight variants of either one. A significant reason for their unpopularity is the fact that using these methods requires looking up a table of k-mers, be it a set of polar k-mers or universal hitting k-mers, for every k-mer in the query sequence. In contrast, for a random minimizer imple- mented using a hash function, no lookup is required during the sequence sketch generation process. Since these lookup tables are usually the result of sequence-specific optimization, we say these methods fall into the category of “sketches-by-optimization”. This contrast leads to interesting tradeoffs in efficiency. For example, using a polar-set-compatible minimizer generates a more compact sequence sketch, but might take more time at query compared to using a random minimizer, due to the time spent in loading and querying the set of polar k-mers. We believe better implementation of k-mer lookup tables and better optimization of sequence sketches, possibly in a joint manner, will popularize sketches-by-optimization. Existing methods already take step towards this goal. Jain et al. (2020b) uses a compact lookup table to index frequent k-mers, and Liu et al. 14 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ (2019) uses a Bloom filter to perform approximate query over fixed interval samples. Techniques like k-mer Bloom filters (Pellow et al., 2017) might also further help the performance. 4.3 Alternative Measurements of Efficiency Throughout this manuscript our goal has been the optimization of specific density. Low density results in smaller sequence sketches, and for many applications this is desirable. However, depending on the way one uses the sequence sketch, alternative measurements of efficiency may be desirable (also see discussion in Edgar (2020)). For example, in k-mer counting, minimizers are used to place k-mers into buckets. In this case, the specific density is less relevant, and we are more concerned about the number of buckets, and the load balance between different buckets (Marçais et al., 2017; Nyström-Persson et al., 2020). For read mapping, smaller sequence sketches have its own advantage, while some may prefer reducing the number of matches, or reducing the false positive seed matches in general. We believe many of these objectives are correlated with each other, and we are interested in both further exploring benefits of a small sequence sketch, and optimization techniques for alternative measurements of efficiency. 5 Conclusion Inspired by deficiencies with current theory and practice around sequence-specific minimizers, we propose the concept of polar sets, a new approach to construct sequence-specific minimizers with the ability to directly optimize the specific density of the resulting sequence sketch. We also propose simple and efficient heuristics for constructing (layered) polar sets, and demonstrate via experiments on the human reference genome the superior performance of minimizers constructed by our proposed heuristics. While there are still concerns around the practical utility, we believe the polar set framework will be a valuable asset in design and analysis of efficient sequence sketches. Funding This work has been supported in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to C.K., by the US National Institutes of Health (R01GM122935), and the US National Science Foundation (DBI-1937540). This work was partially funded by The Shurl and Kay Curci Foundation. This project is funded, in part, under a grant (#4100070287) with the Pennsylvania Department of Health. The Department specifically disclaims responsibility for any analyses, interpretations or conclusions. Conflict of interests: C.K. is a co-founder of Ocean Genomics, Inc. G.M. is V.P. of software development at Ocean Genomics, Inc. References Almutairy, M. and Torng, E. (2018). Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches. PLOS ONE , 13(2), e0189960. Blackburn, S. R. (2015). Non-overlapping codes. IEEE Transactions on Information Theory , 61(9), 4890– 4894. Chikhi, R., Limasset, A., and Medvedev, P. (2015). Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics , 32(12), i201–i208. DeBlasio, D., Gbosibo, F., Kingsford, C., and Marçais, G. (2019). Practical universal k-mer sets for minimizer schemes. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics , BCB ’19, pages 167–176, New York, NY, USA. ACM. 15 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Deorowicz, S., Kokot, M., Grabowski, S., and Debudaj-Grabysz, A. (2015). KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics , 31(10), 1569–1576. Edgar, R. C. (2020). Syncmers are more sensitive than minimizers for selecting conserved k-mers in biological sequences. bioRxiv . Ekim, B., Berger, B., and Orenstein, Y. (2020). A randomized parallel algorithm for efficiently finding near-optimal universal hitting sets. BioRxiv: 2020.01.17.910513. Erbert, M., Rechner, S., and Müller-Hannemann, M. (2017). Gerbil: a fast and memory-efficient k-mer counter with GPU-support. Algorithms for Molecular Biology , 12(1), 9. Frith, M. C., Noé, L., and Kucherov, G. (2020). Minimally-overlapping words for sequence similarity search. BioRxiv . Jain, C., Rhie, A., Hansen, N., Koren, S., and Phillippy, A. M. (2020a). A long read mapping method for highly repetitive reference sequences. bioRxiv , page 2020.11.01.363887. Jain, C., Rhie, A., Zhang, H., Chu, C., Walenz, B. P., Koren, S., and Phillippy, A. M. (2020b). Weighted minimizer sampling improves long read mapping. Bioinformatics , 36(Supplement 1), i111–i118. Kempa, D. and Kociumaka, T. (2019). String synchronizing sets: sublinear-time bwt construction and optimal lce data structure. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing , pages 756–767. Khiste, N. and Ilie, L. (2015). E-mem: efficient computation of maximal exact matches for very large genomes. Bioinformatics , 31(4), 509–514. Levenshtein, V. I. (1970). Maximum number of words in codes without overlaps. Problemy Peredachi Informatsii , 6(4), 88–90. Li, H. and Birol, I. (2018). Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics , 34(18), 3094–3100. Liu, Y., Zhang, L. Y., and Li, J. (2019). Fast detection of maximal exact matches via fixed sampling of query k-mers and bloom filtering of index k-mers. Bioinformatics , 35(22), 4560–4567. Marçais, G., Pellow, D., Bork, D., Orenstein, Y., Shamir, R., and Kingsford, C. (2017). Improving the performance of minimizers and winnowing schemes. Bioinformatics , 33(14), i110–i117. Marçais, G., DeBlasio, D., and Kingsford, C. (2018). Asymptotically optimal minimizers schemes. Bioin- formatics , 34(13), i13–i22. Marçais, G., Solomon, B., Patro, R., and Kingsford, C. (2019). Sketching and sublinear data structures in genomics. Annual Review of Biomedical Data Science, 2(1), 93–118. Mykkeltveit, J. (1972). A proof of Golomb’s conjecture for the de Bruijn graph. Journal of Combinatorial Theory, Series B , 13(1), 40–45. Nyström-Persson, J. T., Keeble-Gagnère, G., and Zawad, N. (2020). Compact and evenly distributed k-mer binning for genomic sequences. bioRxiv . Orenstein, Y., Pellow, D., Marçais, G., Shamir, R., and Kingsford, C. (2016). Compact universal k-mer hitting sets. In Algorithms in Bioinformatics , Lecture Notes in Computer Science, pages 257–268. Springer, Cham. Pellow, D., Filippova, D., and Kingsford, C. (2017). Improving bloom filter performance on sequence data using k-mer bloom filters. Journal of Computational Biology , 24(6), 547–557. Roberts, M., Hunt, B. R., Yorke, J. A., Bolanos, R. A., and Delcher, A. L. (2004a). A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology , 11(4), 734–752. 16 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M., and Yorke, J. A. (2004b). Reducing storage requirements for biological sequence comparison. Bioinformatics , 20(18), 3363–3369. Schleimer, S., Wilkerson, D. S., and Aiken, A. (2003). Winnowing: Local Algorithms for Document Finger- printing. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD ’03, pages 76–85. ACM. Ye, C., Ma, Z. S., Cannon, C. H., Pop, M., and Yu, D. W. (2012). Exploiting sparseness in de novo genome assembly. BMC Bioinformatics , 13, S1. Zheng, H., Kingsford, C., and Marçais, G. (2020a). Improved design and analysis of practical minimizers. Bioinformatics , 36(Supplement 1), i119–i127. Zheng, H., Kingsford, C., and Marçais, G. (2020b). Lower density selection schemes via small universal hitting sets with short remaining path length. arXiv preprint arXiv:2001.06550 . 17 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Supplementary Materials S1 A Technical Lemma on k-mer Repetition Here we prove a technical lemma on repetitive occurrence of k-mers. Similar versions of this can be found in (Chikhi et al., 2015). Recall σ is the size of the alphabet. Lemma S7. Given a random sequence and a pair of locations i < j, the probability that the k-mer starting at i equals the k-mer starting at j is exactly σ−k. Proof. If j − i ≥ k, the two k-mers do not share bases, so given they are both random k-mers independent of each other, the probability is σ−k. Otherwise, the two k-mers intersect. We let d = j − i, and use mi to denote the k-mer starting at location i. We use s to denote the substring from the start of mi to the end of mj with length k + d (or equivalently, the union of mi and mj). If mi = mj, the p th character of mi is equal to the pth character of mj, meaning sp = sp+d for all 0 ≤ p < k. This further means s is a repeating sequence of period d, so s is uniquely determined by its first d characters and there are σd possible configurations of s. The probability a random s satisfies mi = mj is then σ d/σk+d = σ−k. S2 Universal Hitting Sets and Related Analyses Universal hitting sets have been an important component in constructing practical minimizers. In this section, we provide a more formal and technical discussion on universal hitting sets. In Section S2.1, we for- mally define UHS and discuss why existing heuristics to construct UHS are not adequate for sequence-specific minimizer. In Section S2.2 and Section S2.3, we discuss the two existing methods to analyze compatible minimizers of UHSes, and show that these approaches both have issues that make them unfit for our goal. In Section S2.4 we discuss how UHSes can in fact be treated as special cases of polar sets, which may inspire new developments in this line of research. S2.1 Definitions and Inelasticity of UHS Definition S10 (Universal Hitting Sets). Let U be a set of k-mers. If U intersects with every w consecutive k-mers, it is a UHS over k-mers with path length w and relative size |U|/σk. A decycling set is a set of k-mers that intersect with any sufficiently long strings. Any universal hitting sets must be a decycling set, so lower bound on the size of decycling sets applies to all universal hitting sets. Lemma S8 (Minimal Decycling Sets). Any UHS over k-mers with finite path length has relative size Ω(1/k). With a universal hitting set, it is guaranteed that any compatible minimizer will only select k-mers within the UHS on any sequence. Currently, the most popular approach for constructing efficient minimizers is via construction of a compact universal hitting set, followed by constructing a compatible minimizer. These universal hitting sets are usually constructed by expanding from a minimal decycling set. As we have shown before (Zheng et al., 2020b), the Mykkeltveit MDS (Mykkeltveit, 1972), the MDS that is predominantly used as the starting point already covers all windows of length O(k3). Empirically, with larger value of w only a few k-mers needs to be added to satisfy the universal hitting condition. As a result, UHSes constructed for different references look like each other, and the compatible minimizers do not specialize well. A related concern about using UHSes on specific sequences is on handling of repetitive k-mers. As we have discussed, repetitive k-mers are prevalent in human reference genome. Any universal hitting set always contains homomers like AAA · · ·A as it is required to cover a sequence of all As. This argument also extends to other repetitive k-mers. Such homomers, or repetitive k-mers, would then be preferred when using compatible minimizers for sequence sketching. This problem of prioritizing repetitive k-mers is also present in fixed interval sampling. Meanwhile, existing literature (Li and Birol, 2018; Jain et al., 2020b) suggests it is in fact beneficial to not select these k-mers for read mapping, while proposing different remedies to this issue. Our proposed methods also have the effect of avoiding repetitive k-mers, as these k-mers likely don’t pass the filtering step. S1 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ S2.2 Analysis via Density Upper Bound There are two existing ways to analyze the density of compatible minimizers. The first is via the following lemma, as we have mentioned in the main text: Lemma S9. If U is a UHS over k-mers, any compatible minimizer has density at most |U|/σk. This lemma is universally applicable and it does not depend on the ordering within U. However, this is an upper bound which becomes non-informative with w > 2k and sufficiently large k. Because any universal hitting set is at least as large as a minimal decycling set (Lemma S8), and a random (w,k)−minimizer achieves density of approximately 2/(w + 1), Lemma S10 at best tells us the compatible minimizer is no worse than a random one. S2.3 Analysis via Probability of Single UHS Contexts There is a second approach to analysis of compatible minimizers from universal hitting sets (Marçais et al., 2017). The key lemma reads as follows (slightly rephrased): Lemma. If U is a UHS over k-mers, let SP(U) be the probability that a context contains only one element in U. Under certain assumptions, the expected density of a random minimizer compatible with U is 2(1 − SP(U))/(w + 1). We now show this lemma depends on assumptions that highly depends on the structure of U. We start with some notations, slightly different from the original paper. Fix a context, let mi denote the ith k-mer in the context. We also let zi = 1(mi ∈ U), let H denote the event that the context is charged, and let Z = ∑w i=0 zi. Let C(n,k) be the binomial coefficients. The proof involves the following equation (we only list the first term - there are four analogous terms): P(H | Z = j) = C(w − 1,j − 1) C(w + 1,j) P(H | Z = j,z0 = 1,zw = 0) + · · · which involves a counting argument: Given Z = ∑w i=0 zi = j, there are C(w + 1,j) different configurations of z, and C(w−1,j−1) of them satisfies z0 = 1 and zw = 0. However, by invoking this counting argument, it is implicitly assumed that every configuration satisfying ∑ zi = j happens with the same probability, as stated (again, we only keep the terms with z0 = 1 and zw = 0 and hide the rest of terms): Assumption. Let P(z) be the probability of generating a random context and observing zi = 1(mi ∈ U). If∑ zi = ∑ z′i, P(z) = P(z ′). If this is true, we also have P(z | Z = j) = 1/C(w + 1,j). We now recover the statement as follows: P(H | Z = j,z0 = 1,zw = 0) = ∑ Z=j,z0=1,zw=0 P(H | z)P(z | Z = j,z0 = 1,zw = 0) = ∑ Z=j,z0=1,zw=0 P(H | z)/C(w,j − 1) P(H | Z = j) = ∑ ∑ Z=j P(H | z)P(z | Z = j) = ∑ Z=j,z0=1,zw=0 P(H | z)P(z | Z = j) + · · · = ∑ Z=j,z0=1,zw=0 P(H | z)/C(w + 1,j) + · · · = C(w − 1,j − 1) C(w + 1,j) P(H | Z = j,z0 = 1,zw = 0) + · · · S2 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ The assumption is true in expectation if the UHS itself is a random subset of Σk, which is not the case as that set also has to satisfy the UHS condition. For a general set U, the probability that a k-mer is in U is highly dependent on whether the preceding intersecting k-mers are in U, and the assumption is likely not valid in most scenarios. Finally, universal hitting sets may be constructed in a specific way to enable better analysis of compatible minimizers, as seen in (Zheng et al., 2020a). We do not discuss these, as they do not apply to other universal hitting sets. S2.4 UHS as Improper Polar Sets The alternative formula for link energy, as described in Section 2.3.4, allows us to define the link energy of any subset of k-mers, not just those satisfying the polar set condition. The main theorem for polar set still holds, but only the upper bound part. Interestingly, if we plug in a universal hitting set, we get Acov = n,Aele = |U|,Aseg = 1 and the link energy of 2n/(w + 1) −|U|− 1, where n is the number of k-mers in the reference sequence and |U| is the total number of times a k-mer in UHS appear in the reference sequence. Plugging this into the main polar set theorem, we recover the specific density upper bound |U|/n for universal hitting sets, up to an error of D(S)/n. In this sense, universal hitting sets can be seen as a specific and extreme case of an improper polar set. S3 NP-Completeness of Optimal Polar Set In this section, we show a reduction from the problem of maximal independent set to the problem of optimal polar set, with an alphabet of 3. Let G = (V,E) be the instance for maximal independent set, and without loss of generality, let |V | = 2d. We use Σ = {X, 0, 1} as the alphabet, and for the polar set instance, we let w = 2d + 1, k = d and s = 0. This means, we want to find subset of d−mers that form many links exactly 2d + 1 bases away, but no two d−mers in the polar set can be fewer than 2d + 1 bases from each other. With s = 0, link energy is equivalent to number of links up to a scaling factor, so we are optimizing number of links that can be formed. We now construct the query string for polar set, which we divide into three sections. Disqualification Gadget. Given an arbitrary d−mer z ∈ Σd, we let the disqualification gadget be the following string: DQ(z) = X2d+1zXdzX2d+1 With presence of DQ(z), z cannot appear in the polar set, because it appears twice exactly 2d bases away in the disqualification gadget. The X2d+1 section on both ends of the gadget is to prevent d-mers within the gadget to form links with adjacent gadgets or sections, as Xd is not in the polar set. Disqualification Section. We append a disqualification gadget to the query string for every d-mer (there are at most 3k = n1.5 of them), except all d-mers containing only 0 and 1. Vertex Section. For each vertex v in G, let a be its binary representation. We add X2d+1aXd+1aX2d+1 to the query string. Edge Section. For each edge (u,v) in G, let a,b be the binary representation of the two ends. We add X2d+1aXdbX2d+1 to the query string. The final query string is formed by the concatenation of three sections. Theorem S2. The maximal independent set can be solved by solving the optimal polar set of aforementioned query string. Proof. We claim any polar set of the query string corresponds to an independent set V ′ of G, with |V ′| links. All d-mer in the polar set are those representing vertexes in G, as other d-mers (those containing X) cannot appear due to the disqualification section. For each d-mer in the polar set, we get one link from the S3 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ vertex section of the query string. If (u,v) ∈ E, the two d-mers representing u and v cannot be selected into the polar set at the same time, because in the edge section these two d-mers are apart by exactly 2d bases, violating polar set condition. On the other hand, all independent sets of G can be represented by a polar set, with total links |V ′| using the same argument. We conclude that the optimal polar set of the query string is representation of a maximal independent set of G, which proves the statement. This reduction also implies hardness of approximately solving optimal polar sets. S4 On Non-Sequence-Specific Polar Sets For the sake of simplicity, in this section we only discuss polar sets with s = 0. The discussion about s > 0 is highly similar. As we have discussed, the density of a minimizer is the expected specific density over a random sequence. Equivalently, it equals the specific density on the de Bruijn sequence of order at least w + k. Therefore, one may construct polar sets on the de Bruijn sequence of sufficient order, to build non-sequence-specific minimizers. However, this is impossible with long windows: Lemma S10. No non-trivial polar set exists when w > k and S is the de Bruijn sequence of order w + k. Proof. We simply show no k-mers can be in the set. For every k-mer m, the sequence mm exists within S, because S is the de Bruijn sequence of order at least 2k. Picking m violates the condition for polar set because it appears twice with k < w bases apart in S. Polar sets exist on de Bruijn sequences of order w +k, when w ≤ k. With w = k, these polar sets become non-overlapping k-mers (Levenshtein, 1970), that is, the set of k-mers where no proper prefix of a k-mer equals a proper suffix of another k-mer. The problem of finding large set of non-overlapping k-mers is hard in general, although constructive algorithms exist (Blackburn, 2015) for constant factor approximation. With w < k we obtain minimally-overlapping k-mers, a concept that has also been studied in other contexts (Frith et al., 2020). We believe the concept of non-sequence-specific polar sets is of both practical and theoretical interest. S5 Supplementary Experiments and Figures S5.1 Density Factor of Layered Polar Sets by Round To show that our proposed layered anchor set heuristics is useful, in Figure 5 we plot the density factor after each round of optimization on the human reference genome hg38. All algorithms are run for a total of 7 rounds, with last two being monotonic rounds. We select 7 to ensure the resulting sets are not too complicated and can be computed in reasonable amount of time. With more rounds, many of the results can be further improved. S5.2 Viability of Sequence-Specific Minimizers on Non-target sequences To validate that optimization of sequence-specific density does not come at the cost of higher (non-sequence- specific) density, we generate the sequence-specific minimizers for hg38 reference genome, then apply these minimizers on a random sequence. Figure 6 shows the results. We expect these to perform close to random minimizers when σk � N where N is the length of the reference sequence. In these cases, most k-mers in a random sequence is not seen in the reference sequence, and optimized sequence-specific minimizers behave just like random minimizers in most cases. The performance for the Miniception is almost identical to that in hg38, and is not shown in this plot. The layered polar sets is also arguably more robust at lower values of k, as its density stays close to that of a random minimizer. S4 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ 1 2 3 4 5 6 7 Round No. (hg38, w=10) 1.2 1.4 1.6 1.8 2.0 Es tim at ed D en si ty F ac to r 1 2 3 4 5 6 7 Round No. (hg38, w=100) 1.0 1.2 1.4 1.6 1.8 2.0 Es tim at ed D en si ty F ac to r Start Lower Bound k=15 k=16 k=17 k=18 k=19 k=20 k=21 k=22 k=23 k=24 k=25 Figure 5: Density factor of layered anchor sets after each round of the optimization, corresponding to the experiments shown in Figure 4B. 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=10) 1.2 1.4 1.6 1.8 2.0 2.2 D en si ty fa ct or 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=100) 1.0 1.2 1.4 1.6 1.8 2.0 D en si ty fa ct or Random Minimizers Lower Bound Fixed Interval Sampling Layered Polar Sets Figure 6: Performance of sequence-specific minimizers on random sequences (optimized on hg38) with w = 10 (left) and w = 100 (right). This is different from Figure 8: Here the specific density is measured on a unrelated random sequence. S5 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ S5.3 Experiments on Human Chromosome 1 To show the effect of reference sequence length on the performance of sequence-specific minimizers, in Figure 7 we show the performance plot when we build sequence-specific minimizers for chr1 only. The human chromosome 1 sequence is around 10% of the whole hg38 sequence, and consistent with our theory, the time and memory spent to run these experiments on chr1 are around 10% of that for hg38 ones. 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=10) 1.2 1.4 1.6 1.8 2.0 D en si ty fa ct or 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=100) 1.0 1.2 1.4 1.6 1.8 2.0 D en si ty fa ct or Random Minimizers Lower Bound Fixed Interval Sampling Miniception Layered Polar Sets Figure 7: Performance of sequence-specific minimizers, optimized and tested on human chromosome 1 with w = 10 (left) and w = 100 (right). S5.4 Building Sequence-Specific Minimizers on Random Sequences To further show that human reference genome is highly repetitive and construction of efficient sequence- specific minimizers is hard in such setup, we run the algorithms to generate sequence-specific minimizers on a random sequence of length 230 000 000 , similar to that of chromosome 1. Figure 8 shows the performance of layered polar sets and fixed interval sampling method. Compared with Figure 7, we observe it is much easier to build efficient minimizers on a random sequence, and to match the theoretical lower bound, even given the reference sequences has similar length. S6 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=10) 1.2 1.4 1.6 1.8 2.0 D en si ty fa ct or 15 16 17 18 19 20 21 22 23 24 25 Value of k (w=100) 1.0 1.2 1.4 1.6 1.8 2.0 D en si ty fa ct or Random Minimizers Lower Bound Fixed Interval Sampling Miniception Layered Polar Sets Figure 8: Performance of sequence-specific minimizers, optimized and tested on a 230 000 000−long random sequence with w = 10 (left) and w = 100 (right). This is different from Figure 6: Here the specific density is measured on the same sequence the minimizers optimize on. S7 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.01.429246doi: bioRxiv preprint https://doi.org/10.1101/2021.02.01.429246 http://creativecommons.org/licenses/by/4.0/ Introduction Methods Overview Polar sets and link energy Key Definitions Perfect Minimizer for short sequences Context Energy and Energy Savers Density Bounds with Polar Sets Hardness of Optimizing Polar Sets Constructing Polar Sets Layered Polar Sets Polar Set Heuristic Layered Heuristics and Hyperparameters Supporting Data Structures Time Complexity Analysis Results Energy Deficit and Energy Surplus Evaluating Polar Set Heuristics Discussion Limits and Future of Polar Sets Practicality of Sketches-by-Optimization Alternative Measurements of Efficiency Conclusion A Technical Lemma on k-mer Repetition Universal Hitting Sets and Related Analyses Definitions and Inelasticity of UHS Analysis via Density Upper Bound Analysis via Probability of Single UHS Contexts UHS as Improper Polar Sets NP-Completeness of Optimal Polar Set On Non-Sequence-Specific Polar Sets Supplementary Experiments and Figures Density Factor of Layered Polar Sets by Round Viability of Sequence-Specific Minimizers on Non-target sequences Experiments on Human Chromosome 1 Building Sequence-Specific Minimizers on Random Sequences