key: cord-0533940-vtuc549g
authors: Benson, Austin R.; Kleinberg, Jon; Veldt, Nate
title: Augmented Sparsifiers for Generalized Hypergraph Cuts with Applications to Decomposable Submodular Function Minimization
date: 2020-07-16
journal: nan
DOI: nan
sha: e3fcabaf9fdb11360184105b47c80c7bece38b42
doc_id: 533940
cord_uid: vtuc549g

In recent years, hypergraph generalizations of many graph cut problems have been introduced and analyzed as a way to better explore and understand complex systems and datasets characterized by multiway relationships. Recent work has made use of a generalized hypergraph cut function which for a hypergraph $mathcal{H} = (V,E)$ can be defined by associating each hyperedge $e in E$ with a splitting function ${bf w}_e$, which assigns a penalty to each way of separating the nodes of $e$. When each ${bf w}_e$ is a submodular cardinality-based splitting function, meaning that ${bf w}_e(S) = g(|S|)$ for some concave function $g$, previous work has shown that a generalized hypergraph cut problem can be reduced to a directed graph cut problem on an augmented node set. However, existing reduction procedures often result in a dense graph, even when the hypergraph is sparse, which leads to slow runtimes for algorithms that run on the reduced graph. We introduce a new framework of sparsifying hypergraph-to-graph reductions, where a hypergraph cut defined by submodular cardinality-based splitting functions is $(1+varepsilon)$-approximated by a cut on a directed graph. Our techniques are based on approximating concave functions using piecewise linear curves. For $varepsilon>0$ we need at most $O(varepsilon^{-1}|e| log |e|)$ edges to reduce any hyperedge $e$, which leads to faster runtimes for approximating generalized hypergraph $s$-$t$ cut problems. For the machine learning heuristic of a clique splitting function, our approach requires only $O(|e| varepsilon^{-1/2} log log frac{1}{varepsilon})$ edges. This sparsification leads to faster approximate min $s$-$t$ graph cut algorithms for certain classes of co-occurrence graphs. Finally, we apply our sparsification techniques to develop approximation algorithms for minimizing sums of cardinality-based submodular functions.

Hypergraphs are a generalization of graphs in which nodes are organized into multiway relationships called hyperedges. Given a hypergraph H = (V, E) and a set of nodes S ⊆ V , a hyperedge e ∈ E is said to be cut by S if both S andS = V \S contain at least one node from e. Developing efficient algorithms for cut problems in hypergraphs is an active area of research in theoretical computer science [17-19, 25, 40] , and has been applied to problems in VLSI layout [4, 31, 39] , sparse matrix partitioning [2, 8] , and machine learning [51, 53, 76] .

Here, we consider recently introduced generalized hypergraph cut functions [51, 53, 75, 80] , which assign different penalties to cut hyperedges based on how the nodes of a hyperedge are split into different sides of the bipartition induced by S. To define a generalized hypergraph cut function, each hyperedge e ∈ E is first associated with a splitting function w e : A ⊆ e → R + that maps each node configuration of e (defined by the subset A ⊆ e in S) to a nonnegative penalty. In order to mirror edge cut penalties in graphs, splitting functions are typically assumed to be symmetric (w e (A) = w e (e\A)) and only penalize cut hyperedges (i.e., w e (∅) = 0). The generalized hypergraph cut function for a set S ⊆ V is then given by cut H (S) = e∈E w e (S ∩ e) .

(1)

The standard hypergraph cut function is all-or-nothing, meaning it assigns the same penalty to a cut hyperedge regardless of how its nodes are separated. Using the splitting function terminology, this means that w e (A) = 0 if A ∈ {e, ∅}, and w e (A) = w e otherwise, where w e is a scalar hyperedge weight. One particularly relevant class of splitting function are submodular functions, which for all A,B ⊆ e satisfy w e (A) + w e (B) ≥ w e (A ∩ B) + w e (A ∪ B). When all hyperedge splitting functions are submodular, solving generalized hypergraph cut problems is closely related to minimizing a decomposable submodular function [22, 23, 43, 52, 62, 72] , which in turn is closely related to energy minimization problems often encountered in computer vision [26, 41, 42] . The standard graph cut function is another well-known submodular special case of (1).

One of the most common techniques for solving hypergraph cut problems is to reduce the hypergraph to a graph sharing similar (or in some cases identical) cut properties. Arguably the most widely used reduction technique is clique expansion, which replaces each hyperedge with a (possibly weighted) clique [12, 31, 51, 81, 83] . In the unweighted case this corresponds to applying a splitting function of the form: w e (A) = |A| · |e\A|. Previous work has also explored other classes of submodular hypergraph cut functions that can be modeled as a graph cut problem on a potentially augmented node set [26, 41, 42, 46, 75] . This research primarily focuses on proving when such a reduction is possible, regardless of the number of edges and auxiliary nodes needed to realize the reduction. However, because hyperedges can be very large and splitting functions may be very general and intricate, many of these techniques lead to large and dense graphs. Therefore, the reduction strategy significantly affects the runtime and practicality of algorithms that run on the reduced graph. This leads to several natural questions. Are the graph sizes resulting from existing techniques inherently necessary for modeling hypergraph cuts? Given a class of functions that are known to be graph reducible, can one determine more efficient or even the most efficient reduction techniques? Finally, is it possible to obtain more efficient reductions and faster downstream algorithms if it is sufficient to just approximately model cut penalties?

To answer these questions, we present a novel framework for sparsifying hypergraph-to-graph reductions with provable guarantees on preserving cut properties. Our framework brings together concepts and techniques from several different theoretical domains, including algorithms for solving generalized hypergraph cut problems [51, 53, 75, 80] , standard graph sparsification techniques [10, 68, 70] , and tools for approximating functions with piecewise linear curves [57, 58] . We present sparsification techniques for a large and natural class of submodular splitting functions that are cardinality-based, meaning that w e (A) = w e (B) whenever |A| = |B|. These are known to always be graph reducible, and are particularly natural for several downstream applications [75] . Our approach leads to graph reductions that are significantly more sparse than previous approaches, and we show that our method is in fact optimally sparse under a certain type of reduction strategy. Our sparsification framework can be directly used to develop faster algorithms for approximately solving hypergraph s-t cut problems [75] , and improve runtimes for a large class of cardinality-based decomposable submodular minimization problems [36, 41, 43, 72] . We also show how our techniques enable us to develop efficient sparsifiers for graphs constructed from co-occurrence data.

Our framework and results share numerous connections with existing work on graph sparsification, which we review here. Let G = (V, E) be a graph with a cut function cut G , which can be viewed as a very restricted case of the generalized hypergraph cut function in Eq. (1) . An ε-cut sparsifier for G is a sparse weighted and undirected graph H = (V, F ) with cut function cut H , such that

for every subset S ⊆ V . This definition was introduced by Benczúr and Karger [11] , who showed how to obtain a sparsifier with O(n log n/ε 2 ) edges for any graph in O(m log 3 n) time for an n-node, m-edge graph. The more general notion of spectral sparsification, which approximately preserves the Laplacian quadratic form of a graph rather than just the cut function, was later introduced by Spielman and Teng [69] . The best cut and spectral sparsifiers have O(n/ε 2 ) edges, which is known to be asymptotically optimal for both spectral and cut sparsifiers [5, 10] . Although studied much less extensively, analogous definitions of cut [19, 40] and spectral [68] sparsifiers for hypergraphs have also been developed. However, these apply exclusively to the all-or-nothing cut penalty, and do not preserve generalized cut functions of the form shown in (1). Bansal et al. [9] also considered a weaker notion of graph and hypergraph sparsification, involving additive approximation terms, but in the present work we only consider multiplicative approximations.

In this paper, we introduce an alternative notion of an augmented cut sparsifier. We present our results in the context of hypergraph-to-graph reductions, though our framework also provides a new notion of augmented sparsifiers for graphs. Let H = (V, E) be a hypergraph with a generalized cut function cut H , and letĜ = (V ∪ A,Ê) be a directed graph on an augmented node set V ∪ A. The graph is equipped with an augmented cut function defined for any S ⊆ V by

where dircutĜ is the standard directed cut function onĜ. We say thatĜ is an ε-augmented cut sparsifier for H if it is sparse and satisfies cut H (S) ≤ cutĜ(S) ≤ (1 + ε)cut H (S).

The minimization involved in (3) is especially natural when the goal is to approximate a minimum cut or minimum s-t cut in H. If we solve the corresponding cut problem inĜ, nodes from the auxiliary node set A will be automatically arranged in a way that yields the minimum directed cut penalty, as required in (3) . IfŜ * is the minimum cut inĜ, S * = V ∩Ŝ * will be a (1+ε)-approximate minimum cut in G. Even when solving a minimum cut problem is not the goal, our sparsifiers will be designed in such a way that the augmented cut function (3) will be easy to evaluate. Unlike the standard graph sparsification problem, in some cases it may in fact be impossible to find any directed graphĜ satisfying (4) , independent of the graph's density. In recent work we showed that hypergraphs with non-submodular splitting functions are never graph reducible [75] . Zivný et al. [84] showed that even in the case of four-node hyperedges, there exist submodular splitting functions (albeit asymmetric splitting functions) that are not representable by graph cuts. Nevertheless, there are several special cases in which graph reduction is possible [26, 41, 42] .

Augmented Sparsifiers for Cardinality-Based Hypergraph Cuts We specifically consider the class of submodular splitting functions that are cardinality-based, meaning they satisfy w e (A) = w e (B) whenever A, B ⊆ e satisfy |A| = |B|. These are known to be graph reducible [41, 75] , though existing techniques will reduce a hypergraph H = (V, E) to a graph with O(|V | + e∈E |e|) nodes and O( e∈E |e| 2 ) edges. We prove the following sparse reduction result. For certain types of splitting functions (e.g., the one corresponding to a clique expansion), we show that our reductions are even more sparse.

Augmented Sparsifiers for Graphs Another relevant class of augmented sparsifiers to consider is the setting where H is simply a graph. In this case, if A is empty and all edges are undirected, condition (4) reduces to the standard definition of a cut sparsifier. A natural question is whether there exist cases where allowing auxiliary nodes and directed edges leads to improved sparsifiers. We show that the answer is yes in the case of dense graphs constructed from co-occurrence data.

Augmented Spectral Sparsifiers Just as spectral sparsifiers generalize cut sparsifiers in the standard graph setting, one can define an analogous notion of an augmented spectral sparsifier for hypergraph reductions. This can be accomplished using existing hypergraph generalizations of the Laplacian operator [16, 53, 55, 80] . However, although developing augmented spectral sparsifiers constitutes an interesting open direction for future research, it is unclear whether the techniques we develop here can be used or adapted to spectrally approximate generalized hypergraph cut functions. We include further discussion on hypergraph Laplacians and spectral sparsifiers in Section 8, and pose questions for future work. Our primary focus in this manuscript is to develop techniques for augmented cut sparsifiers.

Graph reduction techniques work by replacing a hyperedge with a small graph gadget modeling the same cut properties as the hyperedge splitting function. The simplest example of a graph reducible function is the quadratic splitting function, which we also refer to as the clique splitting function: This function can be modeled by replacing a hyperedge with a clique (Figure 1b ). Another function that can be modeled by a gadget is the linear penalty, which can be modeled by a star gadget [83] :

A star gadget ( Figure 1a ) contains an auxiliary node v e for each e ∈ E, which is attached to each v ∈ e with an undirected edge. In order to model the broader class of submodular cardinalitybased splitting functions, we previously introduced the cardinality-based gadget [75] (CB-gadget) ( Figure 1c ). This gadget is parameterized by positive scalars a and b, and includes two auxiliary nodes e and e . For each node v ∈ e, there is a directed edge from v to e and a directed edge from e to v, both of weight a. Lastly, there is a directed edge from e to e of weight a · b. This CB-gadget corresponds to the following splitting function:

Every submodular, cardinality-based (SCB) splitting function can be modeled by a combination of CB-gadgets with different edge weights [75] . A different reduction strategy for minimizing submodular energy functions with cardinality-based penalties was also previously developed by Kohli et al. [41] . Both techniques require up to O(k 2 ) directed edges for a k-node hyperedge.

Sparse Combinations of CB-gadgets Our work introduces a new framework for approximately modeling submodular cardinality-based (SCB) splitting functions using a small combinations of CB-gadgets. Figure 2 illustrates our sparsification strategy. We first associate an SCB splitting function with a set of points {(i, w i )}, where i represents the number of nodes on the "small side" of a cut hyperedge, and w i is the penalty for such a split. We show that when many of these points are collinear, they can be modeled with a smaller number of CB-gadgets. As an example, the star expansion penalties (6) can be modeled with a single CB-gadget (Figures 2a and 2d) , whereas modeling the quadratic penalty with previous techniques [75] requires many more (Figures 2b  and 2e ). Given this observation, we design new techniques for ε-approximating the set of points {(i, w i )} with a piecewise linear curve using a small number linear pieces. We then show how to translate the resulting piecewise linear curve back into a smaller combination of CB-gadgets that ε-approximates the original splitting function. Our piecewise linear approximation strategy allows us to find the optimal (i.e., minimum-sized) graph reduction in terms of CB-gadgets. When ε = 0, (6) can be modeled by a sparse gadget (d). The quadratic splitting function (5) penalties (b) can be modeled by a dense gadget (e). A piecewise linear approximation for the quadratic splitting penalties (c) corresponds to a sparse gadget (f).

our approach finds the best way to exactly model an SCB splitting function, and requires only half the number of gadgets needed by previous techniques [75] . More importantly, for larger ε, we prove the following sparse approximation result, which is used to prove Theorem 1.1.

Theorem 1.2. For ε ≥ 0, any submodular cardinality-based splitting function on a k-node hyperedge can be ε-modeled by combining O(min{log k/ε, k}) CB-gadgets.

We show that a nearly matching lower bound of O(log k/ √ ε) CB-gadgets is required for modeling a square root splitting function. Despite worst case bounds, we prove that only O(ε −1/2 log log 1 ε ) CB-gadgets are needed to approximate the quadratic splitting function, independent of hyperedge size. This is particularly relevant for approximating the widely used clique expansion technique, as well as for modeling certain types of dense co-occurrence graphs. All of our sparse reduction techniques are combinatorial, deterministic, and very simple to use in practice.

When H is just a graph, augmented sparsifiers correspond to a generalization of standard cut sparsifiers that allow directed edges and auxiliary nodes. The auxiliary nodes in this case play a role analogous to Steiner nodes in finding minimum spanning trees. Just as adding Steiner nodes makes it possible to find a smaller weight spanning tree, it is natural to ask whether including an auxiliary node set might lead to better cut sparsifiers for a graph G. We show that the answer is yes for certain classes of dense co-occurrence graphs, which are graphs constructed by inducing a clique on a set of nodes that share a certain property or participate in a certain type of group interaction (equivalently, clique expansions of hypergraphs). Steiner nodes have in fact been previously used in constructing certain types of sparsifiers called vertex and flow sparsifiers [20] . However, these are concerned with preserving certain routing properties between distinguished terminal nodes in a graph, and are therefore distinct from our goal of obtaining ε-cut sparsifiers.

Sparsifying the complete graph Our ability to sparsify the clique splitting function (5) directly implies a new approach for sparsifying a complete graph. Cut sparsifiers for the complete graph provide a simple case study for understanding the differences in sparsification guarantees that can be obtained when we allow auxiliary nodes and directed edges. Furthermore, better sparsifiers for the complete graph can be used to design useful sparsifiers for co-occurrence graphs. We have the following result. Theorem 1.3. Let G = (V, E) be the complete graph on n = |V | nodes. There exists an εaugmented sparsifier for G with O(n) nodes and O(nε −1/2 log log 1 ε ) edges. By comparison, the best standard cut and spectral sparsifiers for the complete graph have exactly n nodes and O(n/ε 2 ) edges. This is tight for spectral sparsifiers [10] , as well as for degreeregular cut sparsifiers with uniform edge weights [3] . Thus, by adding a small number of auxiliary nodes, our sparsifiers enable us to obtain a significantly better dependence on ε when cut-sparsifying a complete graph. Our sparsifier is easily constructed deterministically in O(nε −1/2 log log 1 ε ) time. Standard undirected sparsifiers for the complete graph have received significant attention as they correspond to expander graphs [3, 10, 56, 59] . We remark that the directed augmented cut sparsifiers we produce are very different in nature and should not be viewed as expanders. In particular, unlike for expander graphs, random walks on our complete graph sparsifiers will converge to a very nonuniform distribution. We are interested in augmented sparsifiers for the complete graph simply for their ability to model cut properties in a different way, and the implications this has for sparsifying hypergraph clique expansions and co-occurrence graphs.

Sparsifying co-occurrence graphs Co-occurrence relationships are inherent in the construction of many types of graphs. Formally, consider a set of n = |V | nodes that are organized into a set of co-occurrence interactions C ⊆ 2 V . Interaction c ∈ C is associated with a weight w c > 0, and an edge between nodes i and j is created with weight w ij = c∈C:i,j∈c w c . When w c = 1 for every c ∈ C, w ij equals the number of interactions that i and j share. We use d avg to denote the average number of co-occurrence interactions in which nodes in V participate. The cut value in the resulting graph G = (V, E) for a set S ⊆ V is given by the following co-occurrence cut function:

Graphs with this co-occurrence cut function arise frequently as clique expansions of a hypergraph [12, 31, 81, 83] , or as projections of a bipartite graph [49, 60, 61, 65, 71, 78, 82] . Even when the underlying dataset is not first explicitly modeled as a hypergraph or bipartite graph, many approaches implicitly use this approach to generate a graph from data. When enough group interaction sizes are large, G becomes dense, even if |C| is small. We can significantly sparsify G by applying an efficient sparsifier to each clique induced by a co-occurrence relationship. Importantly, we can do this without ever explicitly forming G. By applying Theorem 1.3 as a black-box for clique sparsification, we obtain the following result. |e|, and µ 2 = e |e| 2 , and where θ max and θ avg denote certain oracle runtimes. These satisfy θ max = Ω(max |e|), and θ avg = Ω( 1 R e |e|). T mf (N, M ) is the time to solve a max-flow problem with N nodes and M edges.

Discrete/Cont Runtime Kolmogorov SF [43] DiscreteÕ(µ 2 ) IBFS Strong [22, 24] Discrete O(n 2 θ max µ 2 ) IBFS Weak [22, 24] DiscreteÕ(n 2 θ max + n e |e| 4 ) AP [22, 52, 62] ContinuousÕ(nRθ avg µ) RCDM [22, 23] ContinuousÕ(n 2 Rθ avg ) ACDM [22, 23] ContinuousÕ(nRθ avg ) Axiotis et al. [6] DiscreteÕ(max e |e| 2 · e |e| 2 θ e + T mf (n, n + µ 2 ) This paper (exact solutions) DiscreteÕ(T mf (µ, µ 2 )) =Õ µ 2 + µ 3/2 This paper (approximate solutions) DiscreteÕ T mf (n + R ε , 1 ε µ) =Õ µ ε + (n + R ε ) 3/2 Theorem 1.4. Let G = (V, E) be the co-occurrence graph for some C ⊆ 2 V and let n = |V |. For ε > 0, there exists an augmented sparsifierĜ with O(n + |C| · f (ε)) nodes and O(n · d avg · f (ε)) edges, where f (ε) = ε −1/2 log log 1 ε . In particular, if d avg is constant and for some δ > 0 we have c∈C |c| 2 = Ω(n 1+δ ), then forming G explicitly takes Ω(n 1+δ ) time, but an augmented sparsifier for G with O(nf (ε)) nodes and O(nf (ε)) edges can be constructed in O(nf (ε)) time.

Importantly, the average co-occurrence degree d avg is not the same as the average node degree in G, which will typically be much larger. Theorem 1.4 highlights that in regimes where d avg is a constant, our augmented sparsifiers will have fewer edges than the number needed by standard ε-cut sparsifiers. In Section 5, we consider simple graph models that satisfy these assumptions. We also consider tradeoffs between our augmented sparsifiers and standard sparsification techniques for co-occurrence graphs. Independent of the black-box sparsifier we used, implicitly sparsifying G in this way will often lead to significant runtime improvements over forming G explicitly.

Typically in hypergraph cut problems it is natural to assume that splitting functions are symmetric and satisfy w e (∅) = w e (e) = 0. However, we show that our sparse reduction techniques apply even when these assumptions do not hold. This allows us to design fast algorithms for approximately solving certain decomposable submodular function minimization (DSFM) problems. Formally a function f : 2 V → R + is a decomposable submodular function if it can be written as

where each f e is a submodular function defined on a set e ⊆ V . Following our previous notation and terminology, we say f e is cardinality-based if f e (S) = g e (|S|) for some concave function g e . This special case, which we refer to as Card-DSFM, has been one of the most widely studied and applied variants since the earliest work on DSFM [43, 72] . In terms of theory, previous research has addressed specialized runtimes and solution techniques [41, 43, 72] . In practice, cardinality-based decomposable submodular functions frequently arise as higher-order energy functions in computer vision [41] and set cover functions [72] . Even previous research on algorithms for the more general DSFM problem tends to focus on cardinality-based examples in experimental results [22, 36, 52, 72] .

Existing approaches for minimizing these functions focus largely on finding exact solutions. Using our sparse reduction techniques, we develop the first approximation algorithms for the problem. Let n = |V |, R = |E|, and µ = e∈E |e|. In Appendix B, we show that a result similar to Theorem 1.1 also holds for more general cardinality-based splitting functions. In Section 6, we combine that result with fast recent s-t cut solvers [73] to prove the following theorem. Theorem 1.5. Let ε > 0. Any cardinality-based decomposable submodular function can be minimized to within a multiplicative (1 + ε) factor inÕ 1 ε e∈E |e| + (n + R ε ) 3/2 time. We compare this runtime against the best previous techniques for Card-DSFM. We summarize runtimes for competing approaches in Table 1 . Our techniques enable us to highlight regimes of the problem where we can obtain significantly faster algorithms in cases where it is sufficient to solve the problem approximately. For example, whenever n = Ω(R), our algorithms for finding approximate solutions provide a runtime advantage -often a significant one -over approaches for computing an exact solution. We provide a more extensive theoretical runtime comparison in Section 6. In Section 7, we show that implementations of our methods lead to significantly faster results for benchmark image segmentation tasks and localized hypergraph clustering problems.

A generalized hypergraph cut function is defined as the sum of its splitting functions. Therefore, if we can design a technique for approximately modeling a single hyperedge with a sparse graph, this in turn provides a method for constructing an augmented sparsifier for the entire hypergraph. We now formalize the problem of approximating a submodular cardinality-based (SCB) splitting function using a combination of cardinality-based (CB) gadgets. We abstract this as the task of approximating a certain class of functions with integer inputs (equivalent to SCB splitting functions), using a small number of simpler functions (equivalent to cut properties of the gadgets). Let 

We denote the set of r-SCB integer functions by S r .

The value w(i) represents the splitting penalty for placing i nodes on the small side of a cut hyperedge. In previous work we showed that the inequalities given in Definition 2.1 are necessary and sufficient conditions for a cardinality-based splitting function to be submodular [75] . The r-SCB integer function for a CB-gadget with edge parameters (a, b) (see (7)) is

Combining J CB-gadgets produces a combined r-SCB integer function of special importance.

Definition 2.

2. An r-CCB ( Combined Cardinality-Based gadget) function of order J, is an r-SCB integer functionŵ with the form

where the t-dimensional vectors a = (a j ) and b = (v j ) parameterizingŵ satisfy:

We denote the set of r-CCB functions of order J by C J r . The conditions on the vectors a and b come from natural observations about combining CBgadgets. Condition (15) ensures that we do not consider CB-gadgets where all edge weights are zero. The ordering in condition (16) is for convenience; the fact that b j values are all distinct implies that we cannot collapse two distinct CB-gadgets into a single CB-gadget with new weights. For condition (17) , observe that for any b J ≥ r, min{i, b J } = i for all i ∈ [r]. For a helpful visual, note that the r-SCB function in (13) represents splitting penalties for the CB-gadget in Figure 1c . An r-CCB function corresponds to a combination of CB-gadgets, as in Figures 2c and 2f .

In previous work we showed that any combination of CB-gadgets produces a submodular and cardinality-based splitting function, which is equivalent to stating that C J r ⊆ S r for all J ∈ N [75] . Furthermore, C r r = S r , since any r-SCB splitting function can be modeled by a combination of r CB-gadgets. Our goal here is to determine how to approximate a function w ∈ S r with some functionŵ ∈ C J r where J r. This corresponds to modeling an SCB splitting function using a small combination of CB-gadgets. Definition 2.3. For a fixed w ∈ S r and an approximation tolerance parameter ε ≥ 0, the Sparse Gadget Approximation Problem ( Spa-GAP) is the following optimization problem:

Upper Bounding Approximations Problem (18) specifically optimizes over functionsŵ that upper bound w. This restriction simplifies several aspects of our analysis without any practical consequence. For example, we could instead fix some δ ≥ 1 and optimize over functionsw satisfying 1 δ w ≤w ≤ δw. However, this implies that the functionŵ = δw satisfies w ≤ŵ ≤ (1 + ε)w, with ε = δ 2 − 1. Thus, the problems are equivalent for the correct choice of δ and ε.

Motivation for Optimizing over CB-gadgets A natural question to ask is whether it would be better to search for a sparsest approximating gadget over a broader classes of gadgets. There are several key reasons why we restrict to combinations of CB-gadgets. First of all, we already know these can model any SCB splitting function, and thus they provide a very simple building block with broad modeling capabilities. Furthermore, it is clear how to define an optimally sparse combination of CB-gadgets: since all CB-gadgets for a k-node hyperedge have the same number of auxiliary nodes and directed edges, an optimally sparse reduction is one with a minimum number of CBgadgets. If we instead wish to optimize over all possible gadgets, it is likely that the best reduction technique will depend on the splitting function that we wish to approximate. Furthermore, the optimality of a gadget may not even be well-defined, since one must take into account both the number of auxiliary nodes as well as the number of edges that are introduced, and the tradeoff between the two is not always clear. Finally, as we shall see in the next section, by restricting to CB-gadgets, we are able to draw a useful connection between sparse gadgets and approximating piecewise linear curves with a smaller number of linear pieces.

We begin by defining the class of piecewise linear functions in which we are interested.

Definition 3.1. For r ∈ N, F r is the class of functions f : [0, ∞] −→ R + such that:

f is concave (and hence, continuous).

It will be key to keep track of the number of linear pieces that make up a given function f ∈ F r . Let L be the set of linear functions with nonnegative slopes and intercept terms:

Every function f ∈ F r can be characterized as the lower envelope of a set of these linear functions.

We use |L| to denote the number of linear pieces of f . In order for (20) to properly characterize a function in F r , it must be constant for all x ≥ r (property 2 in Definition 3.1), and thus L must contain exactly one line of slope zero. The continuous extensionf of an r-CCB function w parameterized by (a, b) is defined aŝ

We prove that continuously extending any r-CCB function always produces a function in F r .

Conversely, every f ∈ F r is the continuous extension of some r-CCB function. Appendix A provides proofs for these results.

Lemma 3.1. Letf be the continuous extension for w, shown in (21) . This function is in the class F r , and has exactly J positive sloped linear pieces, and one linear piece of slope zero.

If w is the r-CCB function parameterized by vectors (a, b), then f is the continuous extension of w.

Let w ∈ S r be an arbitrary SCB integer function. Lemma 3.2 implies that if we can find a piecewise linear function f that approximates w and has few linear pieces, we can extract from it a CCB functionŵ with a small order J that approximates w. Equivalently, we can find a sparse gadget Figure 3 : We restrict our attention to lines in L that coincide with w at at least one integer value. Thus, every function we consider is incident to two consecutive values of w (e.g., the solid line, g (1) ), or, it touches w at exactly one point (dashed line, g).

that approximates an SCB splitting function of interest. Our updated goal is therefore to solve the following piecewise linear approximation problem, for a given w ∈ S r and ε ≥ 0:

The last constraint ensures that each linear piece g ∈ L we consider crosses through at least one point (j, w(j)). We can add this constraint without loss of generality; if any linear piece g is strictly greater than w at all integers, we could obtain an improved approximation by scaling g until it is tangent to w at some point. This constraint, together with the requirement f ∈ F r , implies that the constant function g (r) (x) = w(r) is contained in every set of linear functions L that is feasible for (22) . Since all feasible solutions contain this constant linear piece, our focus is on determining the optimal set of positive-sloped linear pieces needed to approximate w. Optimal linear covers. Given a fixed ε ≥ 0 and i ∈ {0} ∪ [r − 1], we will say a set L ⊂ L is a linear cover for a function w ∈ S r over the range R = {i, i + 1, . . . , r}, if each g ∈ L upper bounds w at all points, and if for each j ∈ R there exists g ∈ L such that g(j) ≤ (1 + ε)w(j). The set L is an optimal linear cover if it contains the minimum number of positive-sloped linear pieces needed to cover R. Thus, an equivalent way of expressing (22) is that we wish to find an optimal linear cover for w over the interval {0} ∪ [r]. In practice there may be many different function f ∈ F r which solve (22) , but for our purposes it suffices to find one.

We solve problem (22) by iteratively growing a set of linear functions L ⊂ L one function at a time, until all of w is covered. Let f be the piecewise linear function we construct from linear pieces in L. In order for f to upper bound w, every function g ∈ L in problem (22) must upper bound w at every i ∈ {0} ∪ [r]. One way to obtain such a linear function is to connect two consecutive points of w. For i ∈ {0} ∪ [r − 1], we denote the line joining points (i, w(i)) and (i + 1, w(i + 1)) by

where the slope of the line is M i = w(i + 1) − w(i). In order for a line to upper bound w but only pass through a single point (i, w(i)) for some i ∈ [r − 1], it must have the form

where the slope m satisfies M i < m < M i−1 . The existence of such a line g is only possible when the points (i − 1, w(i − 1)), (i, w(i)), and (i + 1, w(i + 1)) are not collinear. To understand the strict bounds on m, note that if g passes through (i, w(i)) and has slope exactly M i−1 , then g is in fact the line g (i−1) and also passes through (i − 1, w(i − 1)). If g has slope greater than M i−1 , then g(i − 1) < w(i − 1) and does not upper bound w everywhere. We can similarly argue that the slope of g must be strictly greater than M i so that it does not touch or cross below the point (i + 1, w(i + 1)).

We illustrate both types of functions (23) and (24) in Figure 3 . The following simple observation will later help in comparing approximation properties of different functions in L.

If m g and m h are the slopes of g and h respectively, and m g ≥ m h ≥ 0, then

In other words, if g and h are both tangent to w at the same point j, but g has a larger slope than h, then g provides a better approximation for values smaller than j, while h is the better approximation for values larger than j.

The first linear piece. Every set L solving (22) must include a linear piece that goes through the origin, so that f (0) = 0. We specifically choose g (0) (x) = (w(1) − w(0))x + w(0) = w(1)x to be the first linear piece in the set L we construct. Given this first linear piece, we can then compute the largest integer i ∈ [r] for which g (0) provides a (1 + ε)-approximation:

The integer = p + 1 therefore is the smallest integer for which we do not have a (1 + ε)approximation. If ≤ r, our task is then to find the smallest number of additional linear pieces in order to cover { , . . . , r} with (1 + ε)-approximations. By Observation 3.1, any other g ∈ L with g(0) = 0 and g(1) > w(1) will be a worse approximation to w at all integer values:

. Therefore, as long as we can find a minimum set of additional linear pieces which provides a (1 + ε)-approximation for all { , . . . , r}, our set of functions L will optimally solve objective (22) .

Iteratively finding the next linear piece. Consider now a generic setting in which we are given a left integer endpoint and we wish to find linear pieces to approximate the function w from to r. We first check whether the constant function g (r) (x) = w(r) provides the desired approximation:

If so, we augment L to include g (r) and we are done, since this implies that g (r) also provides at least a (1 + ε)-approximation at every i ∈ { , + 1, . . . , r}. If (25) is not true, we must add another positive-sloped linear function to L in order to get the desired approximation for all i ∈ [r]. We adopt a greedy approach that chooses the next line to be the optimizer of the following objective

In other words, solving problem (26) means finding a function that provides at least a (1 + ε)approximation from to as far towards r as possible in order to cover the widest possible contiguous interval with the same approximation guarantee. (There is always a feasible point by adding a line g tangent to w( ).) The following Lemma will help us prove that this greedy scheme produces an optimal cover for w. Lemma 3.3. Let p * the solution to (26) and g * be the function that achieves it. IfL ⊂ L is an optimal cover for w over the integer range {p * + 1, p * + 2, . . . r}, then {g * } ∪L is an optimal cover for { , + 1, . . . r}.

Proof. LetL be an arbitrary optimal linear cover for w over the range { , + 1, . . . , r}. This means that |L ∪ {g * }| ≥ |L|. We knowL must contain a function g such that g( ) ≤ (1 + ε)w( ). Let p g be the largest integer satisfying g(p g ) ≤ (1 + ε)w(p g ). By the optimality of p * and g * , we know p * ≥ p g . Therefore, the set of functionsL − {g} must be a cover for the set {p g + 1, p g + 2, . . . r} ⊇ {p * + 1, p * + 2, . . . r}. SinceL is an optimal cover for a subset of the integers covered byL − {g},

Therefore, |L ∪ {g * }| = |L|, so the result follows.

We illustrate a simple procedure for solving (26) in Figure 4 . The function g solving (26) must either join two consecutive points of w (the form given in (23)), or coincide at exactly one point of w (form given in (24)). We first identify the integer j * such that

In other words, the linear piece connecting (j * , w(j * )) and (j * + 1, w(j * + 1)) provides the needed approximation at the left endpoint , but g (i) for every i > j * does not. Therefore, the solution to (26) has a slope m ∈ [M j * , M j * +1 ), and passes through the point (j * , w(j * )). By Observation 3.1, the line passing through this point with the smallest slope is guaranteed to provide the best approximation for all integers p ≥ j * . To minimize the slope of the line while still preserving the needed approximation at w( ), we select the line passing through the points ( , (1 + ε)w( )) and (j * , w(j * )). This is given by Figure 4 : Given a left endpoint for which we do not yet have a (1 + ε)-approximate piece, we find the next linear piece by choosing a function g * that provides the desired approximation at , while also providing a good approximation for as large of an integer p > as possible.

After adding this function g * to L, we find the largest integer p ≤ r such that g * (p) ≤ (1 + ε)w(p).

If p < r, then we still need to find more linear pieces to approximate w, so we continue with another iteration. If p = r exactly, then we do not need any more positive-sloped linear pieces to approximate w. However, we still add the constant function g (r) to L before terminating. This guarantees that the function f (x) = min g∈L (x) we return is in fact in F r . Furthermore, adding the constant function serves to improve the approximation, without affecting the order of the CCB function we will obtain from f by applying Lemma 3.2. Pseudocode for our procedure for constructing a set of function L is given in Algorithm 1, which relies on Algorithm 2 for solving (26) . We summarize with a theorem about the optimality of our method for solving (22) . Proof. The optimality of the algorithm follows by inductively applying Lemma 3.3 at each iteration of the algorithm. For the runtime guarantee, note first of all that we can compute and store all slopes and intercepts for linear pieces g (i) (as given in (23)) in O(r) time and space. As the algorithm progresses, we visit each integer i ∈ [r] once, either to perform a comparison of the form g (i) ( ) ≤ (1 + ε)w( ) for some left endpoint , or to check whether g * (i) ≤ (1 + ε)w(i) for some linear piece g * we added to our linear cover L. Each such g * can be computed in constant time, and as a loose bound we know we compute at most O(r) such linear pieces for any ε.

By combining Algorithm 2 and Lemma 3.2, we are able to efficiently solve Spa-GAP.

Theorem 3.5. Let f be the solution to (22) , andŵ be the CCB function obtained from Lemma 3.2 based on f . Thenŵ optimally solves the sparse gadget approximation problem (18) .

Proof. Since f andŵ coincide at integer values, and f approximates w at integer values, we know w(i) ≤ŵ(i) ≤ (1 + ε)w(i) for i ∈ [r]. Thus,ŵ is feasible for objective (18) . If κ * is the number of positive-sloped linear pieces of f , then the order ofŵ is κ * by Lemma 3.2, and this must be optimal for (18) . If it were not optimal, this would imply that there exists some upper bounding

CCB function w of order κ < κ * that approximates w to within 1 + ε. But by Lemma 3.1, this would imply that the continuous extension of w is some f ∈ F r with exactly κ positive-sloped linear pieces that is feasible for objective (22) , contradicting the optimality of f .

In our last section we showed an efficient strategy for finding the minimum number of linear pieces needed to approximate an SCB integer function. We now consider bounds on the number of needed linear pieces in different cases, and highlight implications for sparsifying hyperedges with SCB splitting functions. In the worst case, we show that we need O(log k/ε) gadgets, where k is the size of the hyperedge. Moreover, this is nearly tight for the square root splitting function. Finally, we show that we only need O(ε −1/2 log log 1 ε ) gadgets to approximate the clique splitting function. This result is useful for sparsifying co-occurrence graphs and clique expansions of hypergraphs.

We begin by showing that a logarithmic number of CB-gadgets is sufficient to approximate any SCB splitting function.

Theorem 4.1. Let ε ≥ 0 and w e be an SCB splitting function on a k-node hyperedge. There exists a set of O(log 1+ε k) CB-gadgets, which can be constructed in O(k log 1+ε k) time, whose splitting functionŵ e satisfies w e (A) ≤ŵ e (A)

Proof. Let r = k/2 , and let w ∈ S r be the SCB integer function corresponding to w e , i.e., w(i) = w e (A) for A ⊆ e such that |A| ∈ {i, k − i}. If we join all points of the form (i, w(i)) for i ∈ [r] by a line, this results in a piecewise linear function f ∈ F r that is concave and increasing on the interval [0, r]. We first show that there exists a set of O(log 1+ε r) linear pieces that approximates f on the entire interval [1, r] to within a factor (1+ε). Our argument follows similar previous results for approximating a concave function with a logarithmic number of linear pieces [28, 58] .

For any value y ∈ [1, r], not necessarily an integer, f (y) lies on a linear piece of f which we will denote by g (y) (x) = M y · x + B y , where M y ≥ 0 is the slope and B y ≥ 0 is the intercept. When y = i is an integer, it may be the breakpoint between two distinct linear pieces, in which case we use the rightmost line so that g (y) = g (i) as in (23)

Equivalently, the line g (y) provides a (1 + ε)-approximation for every z ∈ [y, (1 + ε)y]. Thus, it takes J linear pieces to cover the set of intervals [ 

for a positive integer J, and overall at most 1 + log 1+ε r linear pieces to cover all of [0, r].

Since Algorithm 1 finds the smallest set of linear pieces to (1 + ε)-cover the splitting penalties, this smallest set must also have at most O(log 1+ε r) linear pieces. Given this piecewise linear approximation, we can use Lemma 3.2 to extract a CCB functionŵ of order

. Thisŵ in turn corresponds to a set of J CB-gadgets that (1 + ε)-approximates the splitting function w e . Computing edge weights for the CB-gadgets using Algorithm 1 and Lemma 3.2 takes only O(r) time, so the total runtime for constructing the combined gadgets is equal to the number of individual edges that must be placed, which is O(k log 1+ε k). Theorem 1.1 on augmented sparsifiers follows as a corollary of Theorem 4.1. Given a hypergraph H = (V, E) where each hyperedge has an SCB splitting function, we can use Theorem 4.1 to expand each e ∈ E into a gadget that has O(log 1+ε |e|) auxiliary nodes and O(|e| log 1+ε |e|) edges. Since log 1+ε n behaves as 1 ε log n as ε → 0, Theorem 1.1 follows. In Appendix B, we show that using a slightly different reduction, we can prove that Theorem 4.1 holds even when we do not require splitting functions to be symmetric or satisfy w e (∅) = w e (e) = 0. In Section 6 we use this fact to develop approximation algorithms for cardinality-based decomposable submodular function minimization.

Next we show that our upper bound is nearly tight for the square root r-SCB integer function,

For this result, we rely on a result previously shown by Magnanti and Stratila [58] on the number of linear pieces needed to approximate the square root function over a continuous interval.

Let ψ be a piecewise linear function whose linear pieces are all tangent lines to φ, satisfying ψ(

There exists a piecewise linear function ψ * of this form with exactly log γ(ε) u l linear pieces. 1 As ε → 0, this values behaves as 1 √ 32ε log u l . Lemma 4.2 is concerned with approximating the square root function for all values on a continuous interval. Therefore, it does not immediately imply any bounds on approximating a discrete set of splitting penalties. In fact, we know that when lower bounding the number of linear pieces needed to approximate any w ∈ S r , there is no lower bound of the form q(ε)f (r) that holds for all ε > 0, if q is a function such that q(ε) → ∞ as ε → 0. This is simply because we can approximate w by piecewise linear interpolation, leading to an upper bound of O(r) linear pieces even when ε = 0. Therefore, the best we can expect is a lower bound that holds for ε values that may still go to zero as r → ∞, but are bounded in such a way that we do not contradict the O(r) upper bound that holds for all SCB integer functions. We prove such a result for the square root splitting function, using Lemma 4.2 as a black box. When ε falls below the bound we assume in the following theorem statement, forming O(r) linear pieces will be nearly optimal.

Theorem 4.3. Let ε > 0 and w(i) = √ i be the square root r-SCB integer function. If ε ≥ r −δ for some constant δ ∈ (0, 2), then any piecewise linear function providing a (1 + ε)-approximation for w contains Ω(log γ(ε) r) linear pieces, which behaves as Ω(ε −1/2 log r) as ε → 0.

Proof. Let L * be the optimal set of linear pieces returned by running Algorithm 1. In order to show |L * | = Ω(log γ(ε) r), we will construct a new set of linear pieces L that has asymptotically the same number of linear pieces as L * , but also provides a (1 + ε)-approximation for all x in an interval [r β , r] for some constant β < 1. Invoking Lemma 4.2 will then guarantee the final result.

Recall that L * includes only two types of linear pieces: either linear pieces g satisfying g(j) = √ j for exactly one integer j (see (24) ), or linear pieces formed by joining two points of w (see (23)). For the square root splitting function, the latter type of linear piece is of the form

for some positive integer t less than r. This is the linear interpolation of the points (t, √ t) and (t + 1, √ t + 1). Both types of linear pieces bound φ(x) = √ x above at integer points, but they may cross below φ at non-integer values of x. To apply Lemma 4.2, we would like to obtain a set of linear pieces that are all tangent lines to φ. We accomplish this by replacing each linear piece in L * with two or three linear pieces that are tangent to φ at some point. For a positive integer j, let g j denote the line tangent to φ(x) = √ x at x = j, which is given by

We form a new set of linear pieces L made up of lines tangent to φ using the following replacements:

• If L * contains a linear piece g that satisfies g(j) = √ j for exactly one integer j, add lines g j−1 , g j , and g j+1 to L.

• If for an integer t, L * contains the line g (t) as given by Eq. (29) , add lines g t and g t+1 to L.

By Observation 3.1, this replacement can only improve the approximation guarantee at integer points. Therefore, L provides a (1 + ε)-approximation at integer values, is made up strictly of lines that are tangent to φ, and contains at most three times the number of lines in L * . Due to the concavity of φ, if a single line g ∈ L provides a (1 + ε)-approximation at consecutive integers i and i+1, then g provides the same approximation guarantee for all x ∈ [i, i+1]. However, if two integers i and i + 1 are not both covered by the same line in L, then L does not necessarily provide a (1 + ε)-approximation for every x ∈ [i, i + 1]. There can be at most |L| intervals of this form, since each interval defines an "intersection" at which one line g ∈ L ceases to be a (1 + ε)approximation, and another line g ∈ L "takes over" as the line providing the approximation.

By Lemma 4.2, we can cover an entire interval [i, i+1] for any integer i using a set of log γ(ε) 1+ 

can be covered by a single linear piece if i ≥ r δ/2 . Therefore, for each interval [i, i + 1], with i ≥ r δ/2 , that is not already covered by a single linear piece in L, we add one more linear piece to L to cover this interval. This at most doubles the size of L. The resulting set L will have at most 6 times as many linear pieces as L * , and is guaranteed to provide a (1 + ε)-approximation for all integers, as well as the entire continuous interval [r δ/2 , r]. Since δ is a fixed constant strictly less than 2, applying Lemma 4.2 shows that L has at least log γ(ε) r r δ/2 = Ω(log γ(ε) r 1−δ/2 ) = Ω(log γ(ε) r) linear pieces. Therefore, |L * | = Ω(log γ(ε) r) as well.

When approximating the clique expansion splitting function, Algorithm 1 will in fact find a piecewise linear curve with at most O(ε −1/2 log log 1 ε ) linear pieces. We prove this by highlighting a different approach for constructing a piecewise linear curve with this many linear pieces, which upper bounds the number of linear pieces in the optimal curve found by Algorithm 1.

Clique splitting penalties for a k-node hyperedge correspond to nonnegative integer values of the continuous function ζ(x) = x · (k − x). As we did in Section 3.3, we want to build a set of linear pieces L that provides and upper bounding (1 + ε)-cover of ζ at integer values in [0, r], where r = k/2 . We start by adding the line g (0) (x) = (w(1) − w(0))x + w(0) = (k − 1) · x to L, which perfectly covers the first two splitting penalties w(0) = 0 and w(1) = k − 1. In the remainder of our new procedure we will find a set of linear pieces to (1 + ε)-cover ζ at every value of x ∈ [1, k/2], even non-integer x.

We apply a greedy procedure similar to Algorithm 1. At each iteration we consider a leftmost endpoint z i which is the largest value in [1, k/2] for which we already have a (1 + ε)-approximation. In the first iteration, we have z 1 = 1. We then would like to find a new linear piece that provides a (1 + ε)-approximation for all values from z i to some z i+1 , where the value of z i+1 is maximized. We restrict to linear pieces that are tangent to ζ. The line tangent to ζ at t ∈ [1, k/2] is given by

We find z i+1 in two steps:

1.

Step 1: Find the maximum value t such that g t (z i ) = (1 + ε)ζ(z i ).

Step 2: Given t, find the maximum z i+1 such that g t (z i+1 ) = (1 + ε)ζ(z i+1 ).

After completing these two steps, we add the linear piece g t to L, knowing that it covers all values in [z i , z i+1 ] with a (1 + ε)-approximation. At this point, we will have a cover for all values in [0, z i+1 ], and we begin a new iteration with z i+1 being the largest value covered. We continue until we have covered all values up until z i+1 ≥ k/2. If t > k/2 in Step 1 of the last iteration, we adjust the last linear piece to instead be the line tangent to ζ at x = k/2, so that we only include lines that have a nonnegative slope.

Lemma 4.4. For any z i ∈ [1, k/2], the values of t and z i+1 given in steps 1 and 2 are given by

Proof. The proof simply requires solving two different quadratic equations. For Step 1:

Taking the larger solution to maximize t:

For

Step 2:

We again take the larger solution to this quadratic equation since we want to maximize z i+1 :

Algorithm 3 summarizes the new procedure for covering the clique splitting function. Since

so after one step we have covered the entire interval [1, k/2]. We can therefore focus on ε < 1. Proof. We get a loose bound for the value of t in Lemma 4.4 by noting that (k − z i ) ≥ k/2 ≥ z i :

Algorithm 3 Find a (1 + ε)-cover L for the clique splitting function.

Input: Hyperedge size k, ε ≥ 0 Output: (1 + ε) cover for clique splitting function.

Since we assumed ε < 1, we know that

Therefore, from (33) we see that

> z i + kε 2(1 + ε)

From this we see that at each iteration, we cover an additional interval of length z i+1 − z i > kε 1+ε , and therefore we know it will take at most O(1/ε) iterations to cover all of [1, k/2] . This upper bound is loose, however. The value of z i+1 − z i in fact increases significantly with each iteration, allowing the algorithm to cover larger and larger intervals as it progresses.

Since z 1 = 1 and z i+1 − z i ≥ kε 1+ε , we see that z j ≥ kε for all j ≥ 3. For the remainder of the proof, we focus on bounding the number of iterations it takes to cover the interval [kε, k/2]. We separate the progress made by Algorithm 3 into different rounds. Round j refers to the set of iterations that the algorithm spends to cover the interval

For example, Round 1 starts with the iteration i such that z i ≥ kε, and terminates when the algorithm reaches an iteration i where z i ≥ kε 1/2 . A key observation is that it takes less than 4/ √ ε iterations for the algorithm to finish Round j for any value of j. To see why, observe that from the bound in (36) we have

For each iteration i in Round j, we know that

where C = √ 2/(2(1 + ε)) is a constant larger than 1/4. Since each iteration of Round j covers an interval of length at least C ·k ·ε

, and the right endpoint for Round j is kε ( 1 2 ) j , the maximum number of iterations needed to complete Round j is

Therefore, after p rounds, the algorithm will have performed O(p · ε −1/2 ) iterations, to cover the interval [1, kε ( 1 2 ) p ]. Since we set out to cover the interval [1, k/2], this will be accomplished as soon as p satisfies ε ( 1 2 ) p ≥ 1/2, which holds as long as p ≥ log 2 log 2 1 ε :

This means that the number of iteration of Algorithm 3, and therefore the number of linear pieces in L, is bounded above by O(ε −1/2 log log 1 ε ). We obtain a proof of Theorem 1.3 on sparsifying the complete graph as a corollary.

Proof of Theorem 1.3. A complete graph on n nodes can be viewed as a hypergraph with a single n-node hyperedge with a clique expansion splitting function. Theorem 1.3 says that the clique expansion integer function w(i) = i · (n − i) can be covered with O(ε −1/2 log log ε −1 ) linear pieces, which is equivalent to saying the clique expansion splitting function can be modeled using this many CB-gadgets. Each CB-gadget has two auxiliary nodes and (2n + 1) directed edges. This results in an augmented sparsifier for the complete graph with O(nε −1/2 log log ε −1 ) edges. This is only meaningful if ε is small enough so that O(ε −1/2 log log ε −1 ) is asymptotically less than n, so our sparsifier has O(n + ε −1/2 log log ε −1 ) = O(n) nodes.

Recall from the introduction that a co-occurrence graph is formally defined by a set of nodes V and a set of subsets C ⊆ 2 V . In practice, each c ∈ C could represent some type of group interaction involving nodes in c or a set of nodes sharing the same attribute. We define the co-occurrence graph G = (V, E) on C to be the graph where nodes i and j share an edge with weight w ij = c∈C w c , where w c ≥ 0 is a weight associated with co-occurrence set c ∈ C. The case when w c = 1 is standard and is an example of a common practice of "one-mode projections" of bipartite graphs or affiliation networks [13, 45, 49, 60, 61, 65, 82 ] -a graph is formed on the nodes from one side of a bipartite graph by connecting two nodes whenever they share a common neighbor on the other side, where edges are weighted based on the number of shared neighbors.

A co-occurrence graph G has the following co-occurrence cut function:

In this sense, the co-occurrence graph is naturally interpreted as a weighted clique expansion of a hypergraph H = (V, C), which itself is a special case of reducing a submodular, cardinality-based hypergraph to a graph. However, this type of graph construction is by no means restricted to literature on hypergraph clustering. In many applications, the first step in a larger experimental pipeline is to construct a graph of this type from a large dataset. The resulting graph is often quite dense, as numerous domains involve large hyperedges [64, 76] . This makes it expensive to form, store, and compute over co-occurrence graphs in practice. Solving cut problems on these dense co-occurrence graphs arises naturally in many settings. For example, any hypergraph clustering application that relies on a clique expansion involves a graph with a co-occurrence cut function [1, 31-33, 51, 66, 74, 76, 77, 81, 83] . Clustering social networks is another use case, as online platforms have many ways to create groups of users (e.g., events, special interest groups, businesses, organizations, etc.), that can be large in practice. Furthermore, cuts in co-occurrence graphs of students on a university campus (based on, e.g., common classes, living arrangements, or physical proximity) are relevant to preventing the spread of infectious diseases such as COVID-19.

In these cases, it would be more efficient to sparsify the graph without ever forming it explicitly, by sparsifying large cliques induced by co-occurrence relationships. Although this strategy seems intuitive, it is often ignored in practice. We therefore present several theoretical results that highlight the benefits of this implicit approach to sparsification. Our focus is on results that can be achieved using augmented sparsifiers for cliques, though many of the same benefits could also be achieved with standard sparsification techniques.

Let C be a set of nonempty co-occurrence groups on a set of n nodes, V , and let G = (V, E) be the corresponding co-occurrence graph on C. For c ∈ C, let k c = |c| be the number of nodes in c. For v ∈ V , let d v be the co-occurrence degree of v: the number of sets c containing v. Let d avg = 1 n v∈V d v be the average co-occurrence degree. We re-state and prove Theorem 1.4, first presented in the introduction. The proof holds independent of the weight w c we associate with each c ∈ C, since we can always scale our graph reduction techniques by an arbitrary positive weight.

Theorem. Let ε > 0 and f (ε) = ε −1/2 log log 1 ε . There exists an augmented sparsifier for G with O(n + |C| · f (ε)) nodes and O(n · d avg · f (ε)) edges. In particular, if d avg is constant and for some δ > 0 we have c∈C |c| 2 = Ω(n 1+δ ), then forming G explicitly takes Ω(n 1+δ ) time, but an augmented sparsifier for G with O(nf (ε)) nodes and O(nf (ε)) edges can be constructed in O(nf (ε)) time.

Proof. The set c induces a clique in the co-occurrence graph with O(k 2 c ) edges. Therefore, the runtime for explicitly forming G = (V, E) by expanding cliques and placing all edges equals O( c∈C k 2 c ) = Ω(n 1+δ ). By Theorem 1.3, for each c ∈ C we can produce an augmented sparsifier with O(k c f (ε)) directed edges and O(f (ε)) new auxiliary nodes. Sparsifying each clique in this way will produce an augmented sparsifierĜ = (V ,Ê) where

Observe that n · d avg = v∈V d v = c∈C k c . If d avg is a constant, this implies that c∈C k c = O(n), and furthermore that |C| = O(n), since each k c ≥ 1. Therefore |Ê| and |V | are both O(nf (ε)).

Only O(f (ε)) edge weights need to be computed for the clique, so the overall runtime is just the time it takes to explicitly place the O(nf (ε)) edges.

The above theorem and its proof includes the case where |C| = o(n), meaning that C is made up of a sublinear number of large co-occurrence interactions. In this case, our augmented sparsifier will have fewer than O(nf (ε)) nodes. When |C| = ω(n), the average degree will no longer be a constant and therefore it becomes theoretically beneficial to sparsify each clique in C using standard undirected sparsifiers. For each c ∈ C, standard cut sparsification techniques will produce an εcut sparsifier of c with O(k c ε −2 ) undirected edges and exactly k c nodes. If two nodes appear in multiple co-occurrence relationships, the resulting edges can be collapsed into a weighted edge between the nodes, meaning that the number of edges in the resulting sparsifier does not depend on d avg . We discuss tradeoffs between different sparsification techniques in depth in a later subsection. Regardless of the sparsification technique we apply in practice, implicitly sparsifying a co-occurrence graph will often lead to a significant decrease in runtime compared to forming the entire graph prior to sparsifying it.

We now consider a simple model for co-occurrence graphs with a power-law group size distribution, that produces graphs satisfying the conditions of Theorem 1.4 in a range of different parameter settings. Such distributions have been observed for many types of co-occurrence graphs constructed from real-world data [13, 21] . More formally, let V be a set of n nodes, and assume a co-occurrence set c is randomly generated by sampling a set of size K from a discrete power-law distribution where for k ∈ [1, n]:

Here, C is a normalizing constant for the distribution, and γ and is a parameter of the model. Once K is drawn from this model, a co-occurrence set c is generated with a set of K nodes from V chosen uniformly at random. This procedure can be repeated an arbitrary number of times (drawing all sizes K independently) to produce a set of co-occurrence sets C. This C can then be used to generate a co-occurrence graph G = (V, E). (The end result of this procedure is a type of random intersection graph [14] .) We first consider a parameter regime where set sizes are constant on average but large enough to produce a dense co-occurrence graph that is inefficient to explicitly form in practice. The regime has an exponent γ ∈ (2, 3), which is common in real-world data [21] . Theorem 5.1. Let C be a set of O(n) co-occurrence sets obtained from the power-law model with γ ∈ (2, 3) . The expected degree of each node will be constant and E c∈C |c| 2 = O(n 4−γ ).

Proof. Let K be the size of a randomly generated co-occurrence set. We compute:

Therefore,

For a node v ∈ V and a randomly generated set c, the probability that v will be selected to be in c is

Since there are O(n) co-occurrence sets in C and they each are generated independently, in expectation, v will have a constant degree.

We similarly consider another regime of co-occurrence graphs where the number of co-occurrence sets is asymptotically smaller than n, but the co-occurrence sets are larger on average. Theorem 5.2. Let C be a set of O(n β ) co-occurrence sets, where β ∈ (0, 1), obtained from the power-law co-occurrence model with γ = 1 + β. Then the expected degree of each node will be a constant and E c∈C |c| 2 = O(n 3 ).

Proof. Again let K be a random variable representing the co-occurrence set size. We have

For a node v ∈ V and a randomly generated set c, the probability that v will be in c is

Since there are O(n β ) co-occurrence sets in C, the expected degree of v is a constant.

In Theorem 5.2, the exponent of the power-law distribution is assumed to be directly related to the number of co-occurrence sets in C. This assumption is included simply to ensure that we are in fact considering co-occurrence graphs with O(n) nodes. We could alternatively consider a power-law distribution with exponent γ ∈ (1, 2) and generate O(n β ) co-occurrence sets for any β < 1 − γ. We simply note that in this regime, the expected average degree will be o(1). Assuming we exclude isolated nodes, this will produce a co-occurrence graph with o(n) nodes in expectation. Our techniques still apply in this setting, and we can produce augmented sparsifiers with O(|C|·f (ε)) nodes and O(n · d avg · f (ε)) = o(n · f (ε)) edges. When |C| = Ω(n), then d avg = Ω(1) and the number of edges in our augmented sparsifiers will have worse than linear dependence on n. However, in this regime we can still quickly obtain sparsifiers with O(nε −2 ) edges via implicit sparsification by using standard undirected sparsifiers.

More sophisticated models for generating co-occurrence graphs can also be derived from existing models for projections of bipartite graphs [13] [14] [15] . These make it possible to set different distributions for node degrees in V and highlight other classes of co-occurrence graphs satisfying the assumptions of Theorem 1.4. Here we have chosen to focus on the simplest model for illustrating classes of power-law co-occurrence graphs that satisfy the assumptions of the theorem.

There are several tradeoffs to consider when using different black-box sparsifiers for implicit cooccurrence sparsification. Standard sparsification techniques involve no auxiliary nodes, and have undirected edges, which is beneficial in numerous applications. Also, the number of edges they require is independent of d avg . Therefore, in cases where the average co-occurrence degree is larger than a constant, we obtain better theoretical improvements using standard sparsifiers.

On the other hand, in many settings, it is natural to assume the number of co-occurrences each node belongs to is a constant, even if some co-occurrences are very large. In these regimes, our augmented sparsifiers will have fewer edges that traditional sparsifiers due to a better dependence on ε. Our techniques are also deterministic and our sparsifiers are very easy to construct in practice. Edge weights for our sparsifiers are easy to determine in O(f (ε)) time for each co-occurrence group using Algorithm 1 (or Algorithm 3) coupled with Lemma 3.2. The bottleneck in our construction is simply visiting each node in a set c to place edges between it and the auxiliary nodes. Even in cases where there are no asymptotic reductions in theoretical runtime, our techniques provide a simple and highly practical tool for solving cut problems on co-occurrence data.

Appendix B shows how our sparse reduction techniques can be adjusted to apply even when splitting functions are asymmetric and are not required to satisfy w e (∅) = w e (e) = 0 (the non-cut ignoring property). Section 3 addresses the special case of symmetric and non-cut ignoring functions, as these assumptions are more natural for hypergraph cut problems [51, 53, 75] , and provide the clearest exposition of our main techniques and results. Furthermore, applying the generalized asymmetric reduction strategy in Appendix B to a symmetric splitting function would introduce twice as many edges as applying the reduction from Section 3 designed explicitly for the symmetric case. Nevertheless, the same asymptotic upper bound of O( 1 ε log k) edges holds for approximately modeling the more general splitting function on a k-node hyperedge. By dropping the symmetry and non-cut ignoring assumptions, our techniques lead to the first approximation algorithms for the more general problem of minimizing cardinality-based decomposable submodular functions.

Any submodular function can be minimized in polynomial time [34, 35, 63] , but the runtimes for general submodular functions are impractical in most cases. A number of recent papers have developed faster algorithms for minimizing submodular functions that are sums of simpler submodular functions [22, 23, 36, 38, 43, 52, 62, 72] . This is also know as decomposable submodular function minimization (DSFM). Many energy minimization problems from computer vision correspond to DSFM problems [26, 41, 42 ].

Let f : 2 V → R + be a submodular function, such that for S ⊆ V ,

where for each e ∈ E, f e is a simpler submodular function with support only on a subset e ⊆ V . We can assume without loss of generality that every f e is a non-negative function. The goal of DFSM is to find arg min S f (S). The terminology used for problems of this form differs depending on the context. We will continue to refer to E as a hyperedge set, V as a node set, f e as generalized splitting functions, and f as some type of generalized hypergraph cut function. Cardinality-based decomposable submodular function minimization (Card-DSFM) has received significant special attenation in previous work [36, 38, 41, 43, 72] . This variant of the problem assumes that each function f e satisfies f e (S) = g e (|S|) for some concave function g e Unlike most existing work on generalized hypergraph cut functions [51, 53, 75] , research on DFSM does not typically assume that the functions f e are symmetric, and also do not assume that f e (∅) = f e (e) = 0. In the appendix we demonstrate that our sparse reduction techniques can also be adapted to apply to this more general setting as well. This leads to a proof of Theorem 1.5. The resulting runtime will then beÕ e∈E |e| 2 + e∈E |e| 3/2 .

We refer to our approximation algorithm for Card-DSFM as SparseCard, since it depends on sparse reduction techniques. In the remainder of the section, we provide a careful runtime comparison between SparseCard and competing runtimes for Card-DSFM. We focus on each runtime's dependence on n = |V |, R = |E|, and support sizes |e|, and useÕ notation to hide logarithmic factors of n, R, and 1/ε. To easily compare weakly polynomial runtimes, we assume that each f e has integer outputs, and assume that log(max S f (S)) is small enough that it can also be absorbed byÕ notation. Our primary goal is to highlight the runtime improvements that are possible when an approximate solution suffices. Among algorithms for DSFM, SparseCard is unique in its ability to quickly find solutions with a priori multiplicative approximation guarantees. Previous approaches for DSFM focus on either obtaining exact solutions, or finding a solution to within an additive approximation error > 0 [6, 22, 36, 52] . In the latter case, setting small enough will guarantee an optimal solution in the case of integer output functions. However, these results provide no a priori multiplicative approximation guarantee, which is the traditional focus of approximation algorithms. Furthermore, using a larger value of for these additive approximations only improves runtimes in logarithmic terms. In constrast, setting ε > 0 will often lead to substantial runtime decreases for SparseCard.

Competing runtime guarantees. Table 1 lists runtimes for existing methods for DSFM. We have listed the asymptotic runtime for SparseCard when applying the recent maximum flow algorithm of van den Brand et al. [73] . While this leads to the best theoretical guarantees for our method, asymptotic runtime improvements over competing methods can also be shown using alternative fast algorithms for maximum flow [29, 30, 47] . For the submodular flow algorithm of Kolmogorov [43] , we have reported the runtime guarantee provided specifically for Card-DSFM. While other approaches have frequently been applied to Card-DSFM [22, 36, 52, 72] , runtimes guarantees for this case have not been presented explicitly and are more challenging to exactly pinpoint. Runtimes for most algorithms depend on certain oracles for solving smaller minimization problems at functions f e in an inner loop. For e ∈ E, let O e be a quadratic minimization oracle, which for an arbitrary vector w solves min y∈B(f e) y + w where B(f e ) is the base polytope of the submodular function f e (see [7, 22, 36] for details). Let θ e be the time it take to evaluate the oracle at e ∈ E, and define θ max = max e∈E θ e and θ avg = 1 R e∈E θ e . Although these oracles admit faster implementations in the case of concave cardinality functions, it is not immediately clear from previous work what is the best possible runtime. When w = 0, solving min y∈B(f e) y + w takes O(|e| log |e|) time [36] , so this serves as a best case runtime we can expect for the more general oracle O e based on previous results. We note also that in the case of the function function f e (A) = |A||e\A|, Ene et al. [22] highlight that an O(|e| log |e| + |e|τ e ) algorithm can be used, where τ e denotes the time it take to evaluate f e (S ∩e) for any S ⊆ e. In our runtime comparisons will use the bound θ e = Ω(|e|), as it is reasonable to expect that any meaningful submodular function we consider should take a least linear time to minimize.

Fast approximate solutions (ε > 0). Barring the regime where support sizes |e| are all very small, the accelerated coordinate descent method (ACDM) of Ene et al. [23] provides the fastest previous runtime guarantee. For a simple parameterized runtime analysis, consider a DSFM problem where the average support size is (1/R) e |e| = Θ(n α ) for α ∈ [0, 1], and R = Θ(n β ), where β ≥ 1 − α must hold if we assume each v ∈ V is in the support for at least one function f e . An exact runtime comparison between SparseCard and ACDM depends on the best runtime for the oracle O e for concave cardinality functions. If an O(|e| log |e|) oracle is possible, the overall runtime guarantee for ACDM would be Ω(n 1+α+β ). Meanwhile, for a small constant ε > 0, SparseCard provides a (1 + ε)-approximate solution in timeÕ(n α+β + max{n 3/2 , n 3β/2 }), which will faster by at least a factorÕ( √ n) whenever β ≤ 1. When β > 1, finding an approximation with SparseCard is guaranteed to be faster whenever R = o(n 2+2α ). If the best case oracle O e for concave cardinality functions is ω(|e| log |e|), the runtime improvement of our method is even more significant.

Guarantees for exact solutions (ε = 0). As an added bonus, running SparseCard with ε = 0 leads to the fastest runtime for finding exact solutions in many regimes. In this case, we can guarantee SparseCard will be faster than ACDM when the average support size is Θ(n α ) and R = o(n 2−α ). SparseCard can also find exact solutions faster than other discrete optimization methods [6, 24, 43] in wide parameter regimes. Our method has a faster runtime guarantee than the submodular flow algorithm of Kolmogorov [43] in all parameter regimes, and has a better runtime guarantee than the strongly-polynomial IBFS algorithm [24] when e |e| = o(n 4 ), which includes all cases where R = o(n 3 ). Both variants of IBFS as well as the recent method of Axiotis et al. [6] become impractical if even a single function f e has support size |e| = O(n) (even when the average support size is much smaller). In this case, runtimes for these methods are Ω(n 5 ). Even using the very loose bounds e |e| 2 ≤ Rn 2 and e |e| ≤ nR, our method is guaranteed to find exact solutions faster as long as R = o(n 7/3 ), with significant additional improvements when approximate solutions suffice. In the extreme case where max e |e| = O(1), the algorithm of Axiotis et al. [6] obtains the best theoretical guarantees, as it has the same asymptotic runtime as a single maximum flow computation on an n-node, R-edge graph. It is worth noting that in this regime, running SparseCard with ε = 0 can exactly solve Card-DSFM with the same asymptotic runtime guarantee as long as R = Θ(n), and has the added practical advantage that it requires only one call to a maximum flow oracle.

The runtime guarantee for SparseCard when ε = 0 can be matched asymptotically by combining existing exact reduction techniques [41, 72, 75] with fast maximum flow algorithms. However, our method has the practical advantage of finding the sparsest exact reduction in terms of CB-gadgets. For example, this results in a reduced graph with roughly half the number of edges required if we were to apply our previous exact reduction techniques [75] . Analogously, while Stobbe and Krause [72] ) showed that a concave cardinality function can be decomposed as a sum of modular functions plus a combination of |e| − 1 threshold potentials, our approximation technique will find a linear combination with |e|/2 threshold potentials. This amounts to the observation that any k + 1 points {i, g(i)} can be joined by k/2 + 1 lines instead of using k. Overall though, the most significant advantage of SparseCard over existing reduction methods is its ability to find fast approximate solutions with sparse approximate reductions.

In addition to its strong theoretical guarantees, SparseCard is very practical and leads to substantial improvements in benchmark image segmentation problems and hypergraph clustering tasks. We focus on image segmentation and localized hypergraph clustering tasks that simultaneously include component functions of large and small support, which are common in practice [22, 54, 64, 67, 75] . Image segmentation experiments were run on a laptop with a 2.2 GHz Intel Core i7 processor and 8GB of RAM. For our local hypergraph clustering experiments, in order to run a larger number of experiments we used a machine with 4 x 18-core, 3.10 GHz Intel Xeon gold processors with 1.5 TB RAM. We consider public datasets previously made available for academic research, and use existing open source software for competing methods. 2 

SparseCard provides faster approximate solutions for standard image segmentation tasks previously used as benchmarks for DSFM [22, 36, 52] . We consider the smallplant and octopus segmentation tasks from Jegelka et al. [36, 37] . These amount to minimizing a decomposable submodular function on a ground set of size |V | = 427 · 640 = 273280, where each v ∈ V is a pixel from a 427 × 640 pixel image and there are three types of component functions. The first type are unary potentials for each pixel/node, i.e., functions of support size 1 representing each node's bias to be in the output set. The second type are pairwise potentials from a 4-neighbor grid graph; pixels i and j share an edge if they are directly adjacent vertically or horizontally. The third type are region potentials of the form f e (A) = |A||e\A| for A ⊆ e, where e represents a superpixel region. The problem can be solved via maximum flow even without sophisticated reduction techniques for cardinality functions, as a regional potential function on e can be modeled by placing a clique of edges on e. We compute an optimal solution using this reduction. Table 2 : Results from SparseCard for different ε > 0 on the smallplant instance with 500 superpixels. Sparsity is the fraction of edges in the approximate graph reduction compared with the exact reduction. Finding the exact solution on the dense exact reduced graph took ≈20 minutes. We display the average of 5 runs for competing methods, with lighter colored region showing upper and lower bounds from these runs. SparseCard is deterministic and was run once for each ε on a decreasing logarithmic scale. Our method maintains an advantage even against post-hoc best case parameters for competing approaches: ACDM-best is the best result obtained by running ACDM for a range of empirical parameters c for each dataset and reporting the best result. The default is c = 10 (blue curve). Best post-hoc results for the plots from left to right were c = 25, 10, 50, 25. It is unclear how to determine the best c in advance.

Comparison against exact reductions Compared with the exact reduction method, running SparseCard with ε > 0 leads to much sparser graphs, much faster runtimes, and a posteriori approximation factors that are significantly better than (1 + ε). For example, on the smallplant dataset, when ε = 1.0, the returned solution is within a factor 1.004 of optimality, even though the a priori guarantee is a 2-approximation. The sparse reduced graph has only 0.013 times the number of edges in the exact reduced graph, and solving the sparse flow problem is over two orders of magnitude faster than solving the problem on the exact reduced graph (which takes roughly 20 minutes on a laptop). In Table 2 we list the sparsity, runtime, and a posteriori guarantee obtained for a range of ε values on the smallplant dataset using the superpixel segmentation with 500 regions.

Comparison against continuous optimization algorithms We also compare against recent C++ implementations of ACDM, RCDM, and Incidence Relation AP (an improved version of the standard AP method [62] ) provided by Li and Milenkovic [52] . These use the divide-and-conquer method of Jegelka et al. [36] , implemented specifically for concave cardinality functions, to solve the quadratic minimization oracle O e for region potential functions. Although these continuous optimization methods come with no a priori approximation guarantees, we can compare them against SparseCard by computing a posteriori approximations obtained using intermediate solutions returned after every few hundred iterations. In more detail, we extract the best level set S λ = {i : x i > λ} from the vector of dual variables x = (x i ) at various steps, and compute the ap-proximation ratio argmin λ f (S λ )/f (S * ) where S * is the optimal solution determined via the exact max-flow solution. Figure 5 displays approximation ratio versus runtime for four DSFM instances (two datasets × two superpixel segmentations). SparseCard was run for a range of ε values on a decreasing logarithmic scale from 1 to 10 −4 , and obtains significantly better results on all but the octopus with 500 superpixels instance. This is the easiest instance; all methods obtain a solution within a factor 1.001 of optimality within a few seconds. ACDM depends on a hyperparameter c controlling the number of iterations in an outer loop. Even when we choose the best post-hoc c value for each dataset, SparseCard maintains its overall advantage. Appendix C provides additional details regarding the competing algorithms and their parameter settings.

We focus on comparisons with continuous optimization methods rather than other discrete optimization methods, as the former are better equipped for our goal of finding approximate solutions to DSFM problems involving functions of large support. To our knowledge, no implementations for the methods of Kolmogorov [43] or Axiotis et al. [6] exist. Meanwhile, IBFS [24] is designed for finding exact solutions when all support sizes are small. Recent empirical results [22] confirm that this method is not designed to handle the large region potential functions we consider here.

Graph reduction techniques have been frequently and successfully used as subroutines for hypergraph local clustering and semi-supervised learning methods [51, 54, 76, 79] . Replacing exact reductions with our approximate reductions can lead to significant runtime improvements without sacrificing on accuracy, and opens the door to running local clustering algorithms on problems where exact graph reduction would be infeasible. We illustrate this by using SparseCard as a subroutine for a method we previously designed called HyperLocal [76] . This algorithm finds local clusters in a hypergraph by repeatedly solving hypergraph minimum s-t cut problems, which could also be viewed as instances of Card-DSFM. HyperLocal was originally designed to handle only the δ-linear penalty f e (A) = min{|A|, |e\A|, δ}, for parameter δ ≥ 1, which can already be modeled sparsely with a single CB-gadget. SparseCard makes it possible to sparsely model any concave cardinality penalty. We specifically use approximate reductions for the weighted clique penalty f e (A) = (|e| − 1) −1 |A||e\A|, the square root penalty f e (A) = min{|A|, |e\A|}, and the sublinear power function penalty f e (A) = (min{|A|, |e\A|}) 0.9 , all of which require O(|e| 2 ) edges to model exactly using previous reduction techniques. Weighted clique penalties in particular have been used extensively in hypergraph clustering [1, 44, 79, 83] , including by methods specifically designed for local clustering and semi-supervised learning [48, 79, 81] .

Stackoverflow hypergraph We consider a hypergraph clustering problem where nodes are 15.2M questions on stackoverflow.com and each of the 1.1M hyperedges defines a set of questions answered by the same user. The mean hyperedge size is 23.7, the maximum size is over 60k, and there are 2165 hyperedges with at least 1000 nodes. Questions with the same topic tag (e.g., "common-lisp") constitute small labeled clusters in the dataset. We previously showed that HyperLocal can detect clusters quickly with the δ-linear penalty by solving localized s-t cut problems near a seed set. Applying exact graph reductions for other concave cut penalties is infeasible, due to the extremely large hyperedge sizes, and we found that using a clique expansion after simply removing large hyperedges performed poorly [76] . Using SparseCard as a subroutine opens up new possibilities.

Experimental setup and parameter settings We follow our previous experimental setup [76] for detecting localized clusters in the Stackoverflow hypergraph. We focus on a set of 45 local Figure 6 : Average F1 score obtained for each localized clustering using 4 different hyperedge cut penalties. For clique, x 0.9 , and the square root penalties, we used an approximate reduction with ε = 1.0. The clique penalty had the highest average F1 score on 26 clusters, the δ-linear had the highest on 10 clusters, and x 0.9 had the highest average score on the remaining 9 clusters. clusters, all of which are question topics involving between 2,000 and 10,000 nodes. For each cluster, we generate a random seed set by selecting 5% of the nodes in the target cluster uniformly at random, and then add neighboring nodes to the seed set to grow it into a larger input set Z to use for HyperLocal (see [76] for details). We set δ = 5000 for the δ-linear hyperedge cut function and set a locality parameter for HyperLocal to be 1.0 for all experiments. With this setup, using HyperLocal with the δ-linear penalty will then reproduce our original experimental setup [76] . Our goal is to show how using SparseCard leads to fast and often improved results for alternative penalties that could not previously been used.

Experimental results In Figure 6 we show the mean F1 score for each cluster (across the ten random seed sets) obtained by the δ-linear penalty and the three alternative penalties when ε = 1.0. The clique penalty obtained the highest average F1 score on 26 clusters, the δ-linear obtained the highest average score on 10 clusters, and the sublinear penalty obtained the best average score on the remaining 9 clusters. Importantly, cut penalties that previously could not be used on this dataset (clique, x 0.9 ) obtain the best results for most clusters. The square root penalty does not perform particularly well on this dataset, but it is instructive to consider its runtime (Figure 7f) . Theorem 4.3 shows that asymptotically this function has a worst-case behavior in terms of the number of CB-gadgets needed to approximate it. We nevertheless obtain reasonably fast results for this penalty function, indicating that our techniques can provide effective sparse reductions for any concave cardinality function of interest.

We also ran experiments with ε = 0.1 and ε = 0.01, which led to noticeable increases in runtime but only very slight changes in F1 scores (Figure 7 ). This indicates why exact reductions are not possible in general, while also showing that our sparse approximate reductions serve as fast and : We display differences in runtimes and F1 scores when using different ε values when approximating three hyperedge cut penalties. Using a larger ε provides a significant runtime savings with virtually no affect on F1 scores. Clusters have been re-arranged in the horizontal axis for each hyperedge cut penalty for easier visualization.

very good proxies for exact reductions.

We have introduced the notion of an augmented cut sparsifier, which approximates a generalized hypergraph cut function with a sparse directed graph on an augmented node set. Our approach relies on a connection we highlight between graph reduction strategies and piecewise linear approximations to concave functions. Our framework leads to more efficient techniques for approximating hypergraph s-t cut problems via graph reduction, improved sparsifiers for co-occurrence graphs, and fast algorithms for approximately minimizing cardinality-based decomposable submodular functions. As noted in Section 1.2, an interesting open question is to establish and study analogous notions of augmented spectral sparsification, given that spectral sparsifiers provide a useful generalization of cut sparsifiers in graphs [69] . One way to define such a notion is to apply existing definitions of submodular hypergraph Laplacians [53, 80] to both the original hypergraph and its sparsifier. This requires viewing our augmented sparsifier as a hypergraph with splitting functions of the form w e (A) = a · min{|A|, |e\A|, b}, corresponding to hyperedges with cut properties that can be modeled by a cardinality-based gadget. From this perspective, augmented spectral sparsification means approximating a generalized hypergraph cut function with another hypergraph cut function involving simplified splitting functions. While this provides one possible definition for augmented spectral sparsification, it is not clear whether the techniques we have developed can be used to satisfy this definition. Furthermore, it is not clear whether obtaining such a sparsifier would imply any immediate runtime benefits for approximating the spectra of generalized hypergraph Laplacians, or for solving generalized Laplacian systems [27, 50] . We leave these as questions for future work.

While our work provides the optimal reduction strategy in terms of cardinality-based gadgets, this is more restrictive than optimizing over all possible gadgets for approximately modeling hyperedge cut penalties. Optimizing over a broader space of gadgets poses another interesting direction for future work, but is more challenging in several ways. First of all, it is unclear how to even define an optimal reduction when optimizing over arbitrary gadgets, since it is preferable to avoid both adding new nodes and adding new edges, but the tradeoff between these two goals is not clear. Another challenge is that the best reduction may depend heavily on the splitting function we wish to reduce, which makes developing a general approach difficult. A natural next step would be to at least better understand lower bounds on the number of edges and auxiliary nodes needed to model different cardinality-based splitting functions. While we do not have any concrete results, there are several indications that cardinality-based gadgets may be nearly optimal in many settings. For example, star expansions and clique expansions provide a more efficient way to model linear and quadratic splitting functions respectively, but modeling these functions with cardinality-based gadgets only increases the number of edges by roughly a factor two.

Finally, we find it interesting that using auxiliary nodes and directed edges makes it possible to sparsify the complete graph using only O(nε −1/2 log log 1 ε ) edges, whereas standard sparsifiers require O(nε −2 ). We would like to better understand whether both directed edges and auxiliary nodes are necessary for making this possible, or whether improved approximations are possible using only one or the other.

Proof of Lemma 3.1

Lemma. Letf be the continuous extension for a function w ∈ S r , shown in (21) . This function is in the class F r , and has exactly J positive-sloped linear pieces, and one linear piece of slope zero.

Proof. Define b 0 = 0 for notational convenience. The first three conditions in Definition 3.1 can be seen by inspection, recalling that 0 < a j and 0 < b j ≤ r for all j ∈ [J]. Observe thatf is linear over the interval

In other words, the ith linear piece off , defined over

where the intercept and slope terms are given by Proof. Since f is in F r , it has J positive-sloped linear pieces and one flat linear piece, and therefore it has exactly J breakpoints: 0 < b 1 < b 2 < . . . < b J . Let b = (b j ) be the vector storing these breakpoints. For convenience we define b 0 = 0, though b 0 is not stored in b. By definition, f is constant for all x ≥ r, which implies that b J ≤ r.

Let

, the positive slope of the ith linear piece of f , which occurs in the range [b i−1 , b i ], is given by

The ith linear piece of f is given by

The last linear piece of f is a flat line over the interval x ∈ [b J , ∞), i.e., m J+1 = 0. Since f has positive and strictly decreasing slopes, we can see that

Let w be the order-J CCB function constructed from vectors (a, b), and letf be its resulting continuous extension:f

We must check thatf = f . By Lemma 3.1, we know thatf is in F r and has exactly J + 1 linear pieces. The functions will be the same, therefore, if they share the same values at breakpoints. Evaluatingf at an arbitrary breakpoint b i gives:

We first confirm that the functions coincide at the first breakpoint:

For any fixed i ∈ {2, 3, . . . , J},

. Therefore, f andf are the same piecewise linear function.

In Sections 2 and 3 we focused on sparsification techniques for representing splitting functions that are symmetric and penalize only cut hyperedges:

w e (S) = w(e\S) for all S ⊆ e w e (e) = w e (∅) = 0.

These assumptions are standard for generalized hypergraph cut problems [51, 53, 75] , and lead to the clearest exposition of our main results. In this appendix, we extend our sparse approximation techniques so that they apply even if we remove these restrictions. This will allow us to obtain improved techniques for approximately solving a certain class of decomposable submodular functions (see Section 6). Formally, our goal is to minimize

where each w e is a submodular cardinality-based function, that is not necessarily symmetric and does not need to equal zero when the hyperedge e is uncut. Our proof strategy for reducing this more general problem to a graph s-t cut problem closely follows the same basic set of steps used in Section 3 for the special case.

We first provide a convenient characterization of general cardinality-based submodular functions. By general we mean the splitting function does not need to be symmetric nor does it need to have a zero penalty when the hyperedge is uncut.

Lemma B.1. Let w e be a general submodular cardinality-based splitting function on a k-node hyperedge e, and let w i denote the penalty for any A ⊆ e with |A| = i. Then for i ∈ {1, 2, . . . , k − 1} To simplify our analysis, as we did for the symmetric case, we will define a set of functions that is virtually identical to these splitting functions on k-node hyperedges, but are defined over integers from 0 to k rather than on subsets of a hyperedge.

Our goal is to show how to approximate k-GSCB integer functions using piecewise linear functions with few linear pieces. This in turn corresponds to approximating a hyperedge splitting function with a sparse gadget. In order for this to work for our more general class of splitting functions, we use a slight generalization of an asymmetric gadget we introduced in previous work [75] .

Definition B.2. The asymmetric cardinality-based gadget (ACB-gadget) for a k-node hyperedge e is parameterized by scalars a and b and constructed as follows:

• Introduce an auxiliary vertex v e .

• For each v ∈ e, introduce a directed edge from v to v e with weight a · (k − b), and a directed edge from v e to v with weight a · b.

The ACB-gadget models the following k-GSCB integer function:

To see why, consider where we must place the auxiliary node v e when solving a minimum s-t cut problem involving the ACB-gadget. If we place i nodes on the s-side, then placing v e on the s-side has a cut penalty of ab(k − i), whereas placing v e on the t-side gives a penalty of ai(k − b). To minimize the cut, we choose the smaller of the two options. Previously we showed that asymmetric splitting functions can be modeled exactly by a combination of k − 1 ACB-gadgets [75] . As we did in Section 3 for symmetric splitting functions, we will show here that a much smaller number of gadgets suffices if we are content to approximate the cut penalties of an asymmetric splitting function. In our previous work we enforced the constraint w e (∅) = w e (0) = 0 even for asymmetric splitting functions, but we remove this constraint here. In order to model the cut properties of an arbitrary GSCB splitting function, we define a combined gadget involving multiple ACB-gadgets, as well as edges from each node v ∈ e to the source and sink nodes of the graph. The augmented cut function for the resulting directed grapĥ G = (V ∪ A ∪ {s, t},Ê) will then be given by cutĜ(S) = min T ⊆A dircutĜ({s} ∪ S ∪ T ) for a set S ⊆ V , where dircut is the directed cut function onĜ. Finding a minimum s-t cut inĜ will solve objective (49) , or equivalently, the cardinality-based decomposable submodular function minimization problem. Definition B.3. A k-CG function (k-node, combined gadget function)ŵ of order J is a k-GSCB integer function that is parameterized by scalars z 0 , z k , and (a j , b j ) for j ∈ [J]. The function has the form:ŵ

The scalars parameterizingŵ satisfy b j > 0, a j > 0 for all j ∈ [J] b j < b j+1 for all j ∈ [J − 1] b J < k z 0 ≥ 0 and z k ≥ 0.

Conceptually, the function shown in (52) represents a combination of J ACB-gadgets for a hyperedge e, where additionally for each node v ∈ e we have place a directed edge from a source node s to v of weight z 0 , and an edge from v to a sink node t with weight z k .

The continuous extension of the k-CG function (52) is defined to be:

Lemma B.2. The continuous extensionf ofŵ is nonnegative over the interval [0, k], piecewise linear, concave, and has exactly J + 1 linear pieces.

Proof. Nonnegativity follows quickly from the positivity of z 0 , z k , and (a i , b i ) for i ∈ [J], and b J < k. For other properties, we begin by re-writing the function aŝ

= kz 0 + x(z k − z 0 ) + kx · j:x<b j a j + k · j:x≥b j

and observe that β t is strictly increasing with t, and α t is strictly decreasing with t. Define b 0 = 0 and b J+1 = k for notational convenience. For any t ∈ {0} ∪ [J], the function is linear over the interval [b t , b t+1 ), since for x ∈ [b t , b t+1 ), we havê

Thus,f is piecewise linear. Furthermore, the slope of the line over the interval [b t , b t+1 ) is (z k − z 0 − β + kα t ), which strictly decreases as t increases. The fact that all slopes are distinct means that there are exactly J + 1 linear pieces, and the fact that these slopes are decreasing means that the function is concave over the interval [0, k].

For every function f that is nonnegative, piecewise linear with J + 1 linear pieces, and concave over the interval [0, k], there exists some k-CG function w of order J such that f is the continuous extension of w.

Proof. The function w will be defined by choosing parameters z 0 , z k , and (a j , b j ) for j ∈ [J]. Letf denote the continuous extension of the function w that we will build. From the proof of Lemma B.2, we know that the parameter b j will correspond to the jth breakpoint off . Therefore, given f , we set b j to be the jth breakpoint of the function f , so that the functions match at breakpoints. For convenience, we also set b 0 = 0 and b J+1 = k. We then set z 0 = f (0)/k and z k = f (k)/k, to guarantee thatf (0) = f (0) andf (k) = f (k). In order to set the a j values, we first compute the slopes of each line of f . Let f j = f (b j ) for j ∈ {0} ∪ [J + 1]. The jth linear piece of f has the slope:

Finally, for j ∈ [J] we set a j = 1 k (m j − m j+1 ). All of our chosen parameters satisfy the conditions of Definition B.3, so it simply remains to check that f andf coincide at breakpoints.

Let t ∈ [J]. Using (56) to evaluatingf at breakpoint b t , we get

As we did for symmetric splitting functions, we can quickly find the best piecewise linear (1 + ε)approximation to a k-GSCB integer function w using a greedy approach. We omit proof details, as they exactly mirror arguments provided for the symmetric case. The submodularity constraint 2w(i) ≥ w(i+1)+w(i−1) for i ∈ {0}∪[k] can be viewed as a discrete version of concavity, and will ensure that the piecewise linear function returned by such a procedure will also be nonnegative and concave. After obtaining the piecewise linear approximation, we can apply Lemma B.3 to reverse engineer a k-CG function of a small order that approximates w. We obtain the same asymptotic upper bound on the number of linear pieces needed to approximate w.

Lemma B.4. Let w be a k-GSCB integer function and ε ≥ 0. There exists a k-CG functionŵ of order J = O( 1 ε log k) that satisfies w(i) ≤ŵ(i) ≤ (1 + ε)w(i) for any i ∈ {0} ∪ [k].

Recall that k-CG functions correspond to combinations of ACB-gadgets for a hyperedge e as well as directed edges between nodes in e and the source and sink nodes in some minimum s-t cut problem. Each ACB-gadget involves one new auxiliary node and 2|e| directed edges, and the number of ACB-gadgets is equal to the order of the k-CG function (the number of linear pieces minus one). Let H = (V, E) be a hypergraph with n = |V | nodes, where each splitting function is submodular, cardinality-based, and is not required to be symmetric or penalize only cut hyperedges. Finding the minimum cut in H corresponds to solving the sum of submodular splitting functions given in (49) . For ε ≥ 0, we can preserve cuts in H to within a factor (1 + ε) by introducing a source and sink node s and t and applying our sparse reduction techniques to each hyperedge to obtain a directed graphĜ = (V 

The continuous optimization methods for DSFM that we compare against in image segmentation experiments are implemented in C++ with a MATLAB front end. The Incidence Relation AP (IAP) method is an improved version of the AP method [62] . Li and Milenkovic [52] showed that the runtime of the method can be significantly faster if one accounts for so-called incidence relations, which describe sets of nodes that define the support of a component function. In our experiments we also ran the standard AP algorithm as implemented by Li and Milenkovic, but this always performed noticeably worse that IAP in practice, so we only report results for IAP. Neither of these methods require setting any hyperparameters.

Li and Milenkovic [52] also showed that accounting for incidence relations leads to improved parallel runtimes for ACDM and RCDM, but this does not improve serial runtimes. To simulate improved parallel runtimes, the implementations ACDM and RCDM of these authors include a parallelization parameter α = K/R, where K is the number of projections performed in an inner loop of these methods, and R is the number of component functions. In theory, the K projections could be performed in parallel, leading to faster overall runtimes. The comparative parallel performance between methods can be simulated by seeing how quickly the methods converge in terms of the number of total projections performed. Note however that the implementations themselves are serial, and only simulate what could happen in a parallel setting.

In our experiments our goal is to obtain the fastest possible serial runtimes. Li and Milenkovic [52] demonstrated that the minimum number of total projections needed to achieve convergence to within a small tolerance is typically achieved when α is quite small. When projections are performed in parallel, choosing a larger α may still be advantageous. However, since our goal is to obtain the fast serial runtimes, we chose a small value α = 0.01, based on the results of Li and Milenkovic. We also tried larger and smaller values of α in post-hoc experiments on all four instances of DSFM, though this led to little variation in performance.

In addition to α, ACDM relies on an empirical parameter c controlling the number of iterations in an outer loop. We used the recommended default parameter c = 10. In general it is unclear how to set this parameter a priori to obtain better than default behavior on a given DSFM instance. In order to highlight the strength of SparseCard relative to ACDM, we additionally tried posthoc tuning of c on each dataset to see how much this could affect results. We ran ACDM on each of the four instances of DSFM (2 image datasets × 2 superpixel segmentations each) for all c ∈ {10, 25, 50, 100, 200}, three different times, for 50R projections (i.e., on average we visit and perform a projection step at each component function 50 times). This took roughly 30-45 seconds for each run. We then computed the average duality gap for each c and each instance over the three trials, and re-ran the algorithm for even longer using the best value of c for each instance. The result is shown as ACDM-best in Figure 5 . SparseCard still maintains a clear advantage over this method on the instances that involve 200 superpixels (i.e., very large region potentials). Our method also obtains better approximations for the smallplant dataset with 500 superpixels for the first 40 seconds, after which point ACDM-best converges to the optimal solution. Nevertheless, we would not have been able to see this improved behavior without post-hoc tuning of the hyperparameter c for ACDM. Meanwhile, SparseCard does not rely on any parameter except ε, and it is very easy to understand how setting this parameter affects the algorithm as it directly controls the sparsity and a priori approximation guarantee for our method.

We remark finally that Li and Milenkovic [52] also considered and implemented an alternate version of RCDM (RCDM-greedy) with a greedy sampling strategy for visiting component functions in the method's inner loop. Despite being advantageous for parallel implementations, we found that in practice that this method had worse serial runtimes, so we did not report results for it.

Higher order learning with graphs

Hypergraph partitioning based models and methods for exploiting cache locality in sparse matrix-vector multiplication

On the edge-expansion of graphs

Recent directions in netlist partitioning: a survey

On sketching quadratic forms

Decomposable submodular function minimization via maximum flow

Learning with Submodular Functions: A Convex Optimization Perspective

Hypergraph partitioning for sparse matrix-matrix multiplication

New notions and constructions of sparsification for graphs and hypergraphs

Twice-ramanujan sparsifiers

Approximating s-t minimum cuts inÕ(n 2 ) time

Higher-order organization of complex networks

A simple bipartite graph projection model for clustering in networks

Degree and clustering coefficient in sparse random intersection graphs

Correlation between clustering and degree in affiliation networks

Generalizing the hypergraph laplacian via a diffusion process with mediators

Hypergraph k-cut in randomized polynomial time

Computing minimum cuts in hypergraphs

Minimum cuts and sparsification in hypergraphs

On vertex sparsifiers with steiner nodes

Power-law distributions in empirical data

Decomposable submodular function minimization: discrete and continuous

Random coordinate descent methods for minimizing decomposable submodular functions

Structured learning of sum-of-submodular higher order energy functions

Minimum cut and minimum k-cut in hypergraphs via branching contractions

Energy minimization via graph cuts: settling what is possible

Polynomial-time algorithms for submodular laplacian systems

Graph clustering in all parameter regimes

Fully dynamic electrical flows

Beyond the flow decomposition barrier

Approximation techniques for hypergraph partitioning problems

The total variation on hypergraphs -learning on hypergraphs revisited

Scalable hypergraph learning and processing

A combinatorial strongly polynomial algorithm for minimizing submodular functions

A simple combinatorial algorithm for submodular function minimization

Reflection methods for user-friendly submodular optimization

Submodularity beyond submodular energies: Coupling edges in graph cuts

On fast approximate submodular minimization

Multilevel hypergraph partitioning: applications in vlsi domain

Sketching cuts in graphs and hypergraphs

Robust higher order potentials for enforcing label consistency

What energy functions can be minimized via graph cuts?

Minimizing a sum of submodular functions

Hypergraph clustering by iteratively reweighted modularity maximization

Affiliation networks

Cutsets and partitions of hypergraphs

Path finding methods for linear programming: Solving linear programs inÕ √ rank iterations and faster algorithms for maximum flow

E-tail product return prediction via hypergraph-based local graph cut

Evolving model of weighted networks inspired by scientific collaboration networks

Quadratic decomposable submodular function minimization: Theory and practice

Inhomogeneous hypergraph clustering with applications

Revisiting decomposable submodular function minimization with incidence relations

Submodular hypergraphs: p-laplacians, Cheeger inequalities and spectral clustering

Strongly local hypergraph diffusions for clustering and semi-supervised learning

Hypergraph markov operators, eigenvalues and approximation algorithms

Ramanujan graphs

Separable concave optimization approximately equals piecewise linear optimization

Separable concave optimization approximately equals piecewise-linear optimization

Explicit group-theoretical constructions of combinatorial schemes and their application to the design of expanders and concentrators

The backbone of bipartite projections: Inferring relationships from coauthorship, co-sponsorship, co-attendance and other co-behaviors

Random graphs with arbitrary degree distributions and their applications

On the convergence rate of decomposable submodular function minimization

A faster strongly polynomial time algorithm for submodular function minimization

Clustering with hypergraphs: the case for large hyperedges

Social inertia in collaboration networks

Laplacian eigenvalues and partition problems in hypergraphs

Min norm point algorithm for higher order mrf-map inference

Spectral sparsification of hypergraphs

Spectral sparsification of graphs

Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems

The use of different data sources in the analysis of co-authorship networks and scientific performance

Efficient minimization of decomposable submodular functions

Minimum cost flows, mdps, and 1 -regression in nearly linear time for dense instances

A gomory-hu cut tree representation of a netlist partitioning problem

Hypergraph cuts with general splitting functions

Minimizing localized ratio cut objectives in hypergraphs

Parameterized correlation clustering in hypergraphs and bipartite graphs

Collective dynamics of 'small-world' networks

Local higher-order graph clustering

Cheeger inequalities for submodular transformations

Learning with hypergraphs: Clustering, classification, and embedding

Bipartite network projection and personal recommendation

Multilevel spectral hypergraph partitioning with arbitrary vertex sizes

The expressive power of binary submodular functions

We can simplify several terms using the fact that a j = 1 k (m j − m j+1 ). First of all,. Plugging this into (57), we getSo we see that f =f at breakpoints, and therefore these be the same piecewise linear function.