key: cord-0046698-ne1pi1p9 authors: Janota, Mikoláš; Morgado, António title: SAT-Based Encodings for Optimal Decision Trees with Explicit Paths date: 2020-06-26 journal: Theory and Applications of Satisfiability Testing - SAT 2020 DOI: 10.1007/978-3-030-51825-7_35 sha: 7996942f52548a63ecd5b5c7ecaaf9d779f8ee86 doc_id: 46698 cord_uid: ne1pi1p9 Decision trees play an important role both in Machine Learning and Knowledge Representation. They are attractive due to their immediate interpretability. In the spirit of Occam’s razor, and interpretability, it is desirable to calculate the smallest tree. This, however, has proven to be a challenging task and greedy approaches are typically used to learn trees in practice. Nevertheless, recent work showed that by the use of SAT solvers one may calculate the optimal size tree for real-world benchmarks. This paper proposes a novel SAT-based encoding that explicitly models paths in the tree, which enables us to control the tree’s depth as well as size. At the level of individual SAT calls, we investigate splitting the search space into tree topologies. Our tool outperforms the existing implementation. But also, the experimental results show that minimizing the depth first and then minimizing the number of nodes enables solving a larger set of instances. Decision trees play an important role in machine learning either on their own [6] or in the context of ensembles [5] . Learning decision trees is especially attractive in the context of interpretable machine learning due to their simplicity. However, despite this simplicity, minimization of decision trees is well-known to be an NPhard problem [10, 16] . Yet, smaller trees are likely to generalize better. To learn trees, suboptimal, greedy algorithms are used in practice. With the rise of powerful reasoning engines, recent research has tackled the problem by the use of SAT, CSP, or MILP solvers [1, 27, 37, 38] . Indeed, the state-ofthe-art technology shows that many (NP) hard problems are often successfully solved. Conversely, such applications drive the reasoning technology by providing interesting benchmarks. This paper, follows this line of research and proposes a novel SAT-based encoding. This encoding enables finding a decision tree conforming to the given set of examples with a given depth and number of nodes. A minimal tree is found by iterative calls to a SAT solver while minimizing size and depth. Focusing not only on size but also on depth of the tree brings about opportunities for further analysis. Intuitively, more shallow trees are less likely to over-fit. Indeed, modern packages such as Scikit [30] enable imposing a threshold on the depth, which users have to set manually. Also, a shallow tree is more likely to be interpretable by a human because less memory is required to keep track of a single branch. The problem at hand is of challenging complexity. In practice, we may need to deal with a high number of features and examples, which brings the searchspace of possible trees into extreme dimensions. Looking for an optimal tree means not only finding such tree but also proving that no smaller tree exists. The SAT technology has recently shown a lot of promise in tackling difficult combinatorial questions, e.g. Erdős' discrepancy [22] or the Boolean Pythagorean triples problem [13] , among others. Inspired by these results we also investigate the splitting of search-space based on the topology of the decision tree. The paper has the following main contributions. 1. It proposes a novel SAT-based encoding for decision trees, along with a number of optimizations. 2. Compared to existing encoding, rather than representing nodes it represents paths of the tree. This enables natively controlling not only the tree's size but also the tree's depth. 3 . It shows that minimizing depth first and then size enables tackling harder instances. 4 . It shows that search-space splitting by topologies enables tackling harder instances. 5. The implemented tool outperforms existing work [27] . Standard notions and notation for propositional logic are assumed [36] . A literal is a Boolean variable (x) or its negation (denoted ¬x); a clause is a disjunction of literals a cube is a conjunction of literals. A formula is in conjunctive normal form (CNF) if it is a conjunction of clauses. General Boolean formulas are also considered constructed by using the standard connectives conjunction (∧), disjunction (∨), implication (→), bi-implication (↔). State-of-the-art SAT solvers typically accept input in CNF. Non-CNF formulas are converted to CNF by standard equisatisfiable clausification methods [31] . Several constraints in the paper also rely on cardinality constraints [34] . These are also turned into CNF through standard means, the implementation avails of the cardinality encodings in the tool PySAT [18, 26] . Standard setting of supervised learning is assumed [35] . Following notation and concepts of [27] we expect features to be binary (with values 0, 1). Non-binary features can be reduced to binary by unary or binary encoding. Analogously, classes are also binary (positive, negative). The objective is to develop a propositional formula whose models are decision trees congruent with the given set of samples. Such model then is found by a call to an off-the-shelf SAT solver. As customary, we take the approach of optimizing by solving a series of decision problems. This means finding a decision tree with a certain size and diminishing the size until no such tree exists. Alternatively, other type of search can be used, e.g., binary or progression. This paper targets two optimization criteria: size and depth. Minimizing any combination of the two may be potentially be of interest. Section 5 discusses the exact type of search used in the implementation. The structure of binary trees guarantees a number of well-known properties. Any tree with n nodes has (n + 1)/2 leaves and (n − 1)/2 internal nodes. Further, n is always odd and the number of leaves is equal to the number of paths going from the root to a leaf. Our encoding heavily exploits this property: Rather than modeling nodes of a tree, we model the set of unique paths from the root to leaves. The optimization algorithm has two levels. At the first level, search is being carried out on the tree's size and depth. At the second level, the decision problem of finding a tree with such depth and size is solved via a SAT solver. The SAT solver is used in a black-box fashion, i.e., the problem is encoded into its propositional form and any off-the-shelf SAT solver may be used to solve it. In the remainder of this section we focus on the decision problem, which is invoked with a given number of paths P (controlling size) and maximum allowed number of steps in a path S (controlling depth). The steps in a path are numbered in the following way. In the first step, each path is in the root. In the last step of a path, the path goes from an internal node to a leaf. This means that if we are looking for a tree with a particular depth and particular number of nodes we set S and P accordingly. If we are looking only for a tree with minimal number of nodes but with an arbitrary depth, the value of S is set to P − 1, which corresponds to the number of internal nodes. The encoding we propose models each path from the root to a leaf separately while imposing relations between them that guarantee that the paths form a binary tree. Throughout the paper, we use the convention that for a node labeled by a feature f , the left child corresponds to the value 0 of f and the right child corresponds to the value 1 of f . To model the tree, introduce a matrix of variables, where each row represents a path and each column represents a step in the path. The first row (the first path) is a path that only goes to the left-it is the leftmost path in the tree. Analogously, the last row (the last path) is a path that only goes to the right-it is the rightmost path in the tree. In general, the paths are ordered in the way they would be obtained by running DFS that goes to the left first. Each path corresponds to a sequence of 0's and 1's so that 0 is a step to the left and 1 is a step to the right. Then, we consider these paths in a lexicographic order. Each path is represented by a sequence of variables, one for each step, where the variable represents whether the path goes left or right in that step. Additionally, for each step we need to remember whether the path has already terminated and which prefix is shared with the previous path. Table 1 summarizes the main variables of the encoding. The direction of each step s in a path p is determined by the variable g p s . What is somewhat unusual about this encoding is that paths may share prefixes. To that effect, the variable e p s represents that the path p in step s is in the same node as the preceding path p − 1. The semantics of the variables e p s is defined inductively. All paths share the root and therefore e p 1 must be always true. In further steps, paths p and p − 1 remain equal as long as both paths take steps in the same direction. Since it is unknown beforehand how many steps are in either path, the variables t p s determine whether the path has already terminated or not. Observe that the variables t p s go up to step S + 1, whereas the variables g p s go only to step S. This is because the g p s variables correspond to edges in the path while termination is tracked for nodes (as well as equality). A terminated path remains terminated and cannot terminate if it is still equal to the previous one. Any path Example 1. Figure 1 shows a binary tree along with the values of the topology variables (g p s , t p s , and e p s ). The tree is comprising 4 leaves, therefore 4 paths. In this simple example each path makes two steps and then it terminates. The second path shares everything with the first one except for the leaf. The third path only shares the root with the second path. The last path shares everything with the third path, except for the leaf. Observe that since this is a full binary tree, the g p s variables represent the binary numbers from 0 to 3. Now it is necessary to ensure that the paths are lexicographically ordered. The first path always goes left and the last one always goes right. If a path p in step s is in the same node as path p − 1, the path p can go left only if p − 1 also went left (otherwise they would cross). The lexicographic order alone does not guarantee a correct topology. Since the tree is binary, any path must adhere to the following pattern. For a certain number of steps it shares the prefix with the preceding path until it breaks off. Once it breaks off, it has to go only to the left (or terminate). At the same time, the preceding path can only go right after the break-off point (or terminate). Otherwise, there would be a gap in the tree. Figure 2 illustrates these constraints. Consider the blue path, R → A → B → E and the red path, R → A → C → F , where the blue one is lexicographically smaller. The paths diverge in node A-blue goes left, the red goes right. Afterwards, the blue path may only go right or terminate. In contrast, the red path may only go left or terminate. The reason why this has to be the case is that for the red one to follow blue in our ordering, the blue one has to contain the last path for the subtree rooted in B while the red one has to contain the first path for the subtree rooted in C. Assigning Features and Their Semantics. The encoding of semantics of the training data is similar to the one in [27] but with two major differences: 1. Here classification is only per path, while in [27] it is per node because any node can potentially be a leaf, which means semantics of the examples in our approach need only to be repeated (n + 2)/2 times rather than n times. 2. Our encoding introduces explicit variables to track whether a given training example is matched for a given path, this is useful for one of the optimizations (see Sect. 3.2). We make sure that each step is assigned exactly one feature and that no feature appears more than once on any path. Recall that an example is seen as a set of feature-value pairs. We say that a feature-value pair f, v is matched on a path if the path makes a step in the direction of v in the node that is assigned the feature f . An example is matched if all its feature-value pairs are matched. These two concepts are modeled by the variables m p f,v and m p q , respectively. Observe that f, v is also matched on any path that does not contain f at all. Finally, once a path matches any positive example, it must be classified as positive and the other way around. Summary of the Encoding. The constraints (1)- (16) are parameterized by natural numbers P and S and their satisfying assignments represent a sequence of P paths in a binary tree from the root to a leaf, where each path has at most S edges. The paths are lexicographically ordered, starting from the leftmost path and ending in the rightmost one. Additionally, the encoding ensures that there are no gaps between paths and therefore these represent the whole binary tree. Each node in a path is labeled by a feature in a way that shared prefixes among paths are labeled by the same features. Each path is assigned a classification class that must be congruent with the training examples given on the input. The enforce that the child of a node assigned in the direction of the value v is a leaf classified with the class c. At the formula level, we add the following constraint for s ∈ 1..S and p ∈ 1..P. Path Lower Bounds. We propose to use MaxSAT to obtain lower bounds on the length of a path. The question we ask is what is the shortest possible path that separates positive and negatives examples. Since the lower bound considers only one path at a time, the order of features on that path is irrelevant. In preliminary experiments we have observed rather small lower bounds. However, the bound can be improved for the leftmost and rightmost branches. This gives us three types of bounds: for the leftmost and rightmost branches, and for any branch in between. In any path a feature either does not appear, or appears on a step that goes left or on a step that goes right. To model this behavior we introduce two variables for each feature x 0 f and x 1 f (similar to the dual rail encoding [25] ). This corresponds to the following hard and soft constraints. Upon initial experiments, we observed that the SAT solver may struggle even on decision trees of modest size, e.g. 9 nodes. This is somewhat surprising because the number of topologies does not initially grow that much; see Table 2 . This suggests splitting the search space into individual topologies and call the SAT solver for each one of them separately. Like so, the SAT solver only needs to find the labeling of the tree. Intuitively this should be an easier problem because the SAT solver only needs to deal with one type of decisions. This approach is not generally viable because eventually the number of topologies is too large. To which we propose the following approach. The upper part of the topology is fixed-until a certain depth-and the rest is left for the SAT solver to complete. This gives rise to topology templates. Each topology template is a tree, where each leaf is an actual leaf ( ) of the topology or an incomplete subtree ( ). Algorithm 1 recursively enumerates incomplete topologies on n nodes with the cut-off parameter d. In order to avoid repetitions in enumeration, certain cases need to be treated separately. If the cut-off parameter reaches 1, the children of the current node will either be leaves ( ) or incomplete subtrees ( ). This, in general gives three scenarios where either the left or the right child is a leaf and the second child is a subtree, or both are subtrees. However, in the case of n = 3, n = 5 the scenarios are different. Observe that because of the cut-off parameter, the generated topology template may have less than n nodes. We study topology enumeration both for our encoding as well as the encoding of Narodytska et al. [27] . A given topology template in the encoding of Narodytska et al. is enforced by a cube corresponding to the child relation and the information whether a node is a leaf or not. An important property of our generation procedure is that the cut-off parameter is equal on all branches. This means that numbering the topology template by BFS gives the same numbers as a BFS on any topology corresponding to it. Since the encoding of mindt relies on BFS, this property lets us directly translate the relation into a cube. Our path-based encoding does not allow easily encoding a topology template because the number of paths in an incomplete subtree ( ) is unknown. To this effect, we introduce a variation on the topology template where the leaves of the topology template are actual leaves ( ) or an incomplete subtree with a given cardinality (#k). These topology templates can be easily enumerated as shown by Algorithm 2. Observe that the number of these topology templates may be larger than in the previous version. Such topology template is encoded into our path-based model in a straightforward fashion. Each path in the topology template fixes the direction in prefixes in a certain number of paths. The number of these paths corresponds to the #k node at the end of the path. Any path terminating in corresponds exactly to one path in the path-based model. A cube describing a topology template can either be encoded into assumptions to enable incremental SAT solving [7] or appended as a set of unit clauses. We observed that in our case incremental solving does not pay off for hard instances. However, at the same time, if a large number of topology templates need to be examined, initializing a new SAT solver for each one of them is too costly. Therefore, the implementation employs both modes, incremental and nonincremental, depending on the number of topology templates to be examined. Another point of interest is the order in which the topology templates are examined. In the case of non-incremental SAT solving and unsatisfiable instances, the order does not matter because all formulas need to be solved independently of one another. Hence, the order plays mainly a role in the case of satisfiable instances. The order heuristics we propose is the following. We start with the assumption that we already have a suboptimal solution to the problem from a greedy (fast) algorithm. We would like to first focus on topologies that are similar to the topology of this suboptimal solution. In order to do so, we need some notion of difference between topologies (and topology templates). For this purpose we define a simple function that recursively compares the two topologies and accumulates a penalty once they are different. Additionally, subtrees with lower depth are accounted with less weight. Algorithm 3 shows the function. If one of the given trees is empty, the penalty is the size of the other tree weighted by the factor w. Otherwise, the penalties are calculated as a sum of the left and the right subtrees, respectively. As the recursion descends, the weight is gradually decayed by the factor Δ ∈ (0, 1]. In the implementation we chose the ad-hoc value of 0.75. When partitioning the search space, the topology templates are enumerated in the increasing order of the difference from the suboptimal solution. The tool was implemented on top of the PySAT package [18] , which interfaces with a number of modern SAT solvers and provides a number of implementations of cardinality encodings. We used the CaDiCaL solver [3] and the k-Cardinality Modulo Totalizer [26] . This configuration was chosen after some careful preliminary experiments. We show that this configuration performs significantly better than the configuration used in the evaluation of Narodytska et al. Our preliminary experiments also informed other ad-hoc choices that had to be made as the search and encodings can be configured in a large number of ways. An alternative would be to employ automated parameter tuning in the spirit of ParamILS [15] ; we leave this as future work. The SAT solver is used in a non-incremental fashion, i.e., every decision problem is solved independently of the other ones. The exception is topology enumeration: if the number of topologies is larger than 500, the incremental mode is employed (see Sect. 4). A suboptimal greedy solution is obtained by the popular modern machine learning library Scikit-learn [30] , which also enables a seamless integration with the Python implementation. The greedy solution is used in two scenarios: 1) to obtain an upper bound on the number of nodes in the solution 2) to inform the ordering of topologies during enumeration (see Sect. 4). The experiments were performed on servers with Intel(R) Xeon(R) CPU at 2.60 GHz, 24 cores, 64 GB RAM, while always running 4 tasks in parallel. The time limit was set to 1000 s and the memory limit to 3 GB. The experimental results report on the following search modes: (1) binary search on the number of nodes with no restriction on the depth without topology enumeration (with sklearn upper-bound) (2) linearly increasing the number of nodes with no restriction on the depth with topology enumeration (linear UNSAT-SAT search) (3) linearly increasing depth and linearly increasing number of nodes for each considered depth. Searches (1) and (2) find the smallest tree just as in [27] . The search (3) finds the smallest tree in the lexicographic ordering of the pair depth-size. The evaluation was carried out on the benchmarks used in [27] , kindly provided by Narodytska. These benchmarks were originally obtained by sampling a large set of instances [29] , with sampling percentages of 20%, and 50% (we have used the same exact sampled benchmarks as Narodytska et al.). The reader is referred to [27] for the details of the sampling procedure. We compare our tools with the state-of-the-art tool mindt [27] . Our tool is run in the following configurations. The configuration dtfinder corresponds to the search (1), i.e. size minimization via binary search and path-based encoding. The configuration dtfinder-DT1 is the same type of search but with encoding of Narodytska et al. The suffix -T in a configuration indicates topology-based ing. The configuration d-dtfinder corresponds to the search (3), i.e. depth-size minimization. Table 3 summarizes results for all the considered benchmarks and tools. The first row (%) shows the percentage of random samplings used to construct the instance, the second row the average number of features (nf.) and samples (ns.). The third row (#I) shows the number of benchmarks in that category. The remaining rows are grouped according to the tool they represent. For each of the tools we present four values: the average number of nodes discovered (#nd); the average depth of the tree reported (depth); the average CPU time taken in solved instances (cpu-time); and the number of instances solved (#slv). Figure 3 shows a histogram of the sizes of the optimal trees per solver. The vertical axis shows the number of solved instances and the horizontal groups the solvers according to the number of nodes of the reported decision trees. More detailed overview of the data can be found here on the authors' website [20] . Table 3 enables the following conclusions. Our implementation (dtfinder) outperforms the tool by Narodytska et al. (mindt) in all cases. This is also the case for their encoding; We attribute this to the choice of cardinality encoding and the SAT solver. We used k-Cardinality Modulo Totalizer and CaDiCaL while mindt uses sequential counter and glucose-0.3. Comparing dtfinder with d-dtfinder, we can see that d-dtfinder is faster to compute a minimum depth solution than dtfinder is to compute a minimum size solution and even more interestingly, again solves even more instances. The topology search-space splitting is beneficial in all encodings except for depth minimization. Both our encoding and encoding of Narodytska et al. solves more instances with topology enumeration. Not always this helps the average CPU time; however, it went from 60s to 39s in path-based encoding for the 0.5 instances. The ordering of topologies enables a minor speed-up but overall the effect is small. The distribution of sizes of solved instances (Fig. 3) shows that the hardness of an instance grows drastically with the size. While depth minimization is able to solve a handful of instances of size 29, the path-based encoding solves just 1, our implementation of Narodytska et al. none and, surprisingly mindt 1. This can be attributed to the number of topologies (see Table 2 ). Overall, focusing on minimizing depth first is computationally advantageous, yet yielding decision trees of good quality. We illustrate this on a particular instance. Figure 4 shows decision trees calculated by our approach minimizing depth first (d-dtfinder) and calculated by the greedy approach (sklearn). The optimal tree gives depth 6 and size 29, the greedy approach gives depth 11 and size 37. In contrast, the other approaches timeout on this instance in 1000 s. Greedy algorithms for learning decision trees based on recursive splitting are well-known [6, 32, 33] ; see also [8] for an overview. Various notions of optimality of decision trees appear in the literature. Some approaches focus on finding a tree with a fixed depth but with the best accuracy [1, 37, 38] . These approaches assume a full (perfectly balanced) binary tree of the fixed depth whose accuracy is to be optimized. While the problem is still very hard, it is in some sense easier because the topology is fixed and only the labeling needs to be calculated. However, combinations of these approaches in our approach is an interesting line of research. Another approach is taken by [14] , which optimizes a linear combination of accuracy and size. However, this approach is based on brute force search and in our experiments we were only able to synthesize trees with a handful of features while the considered benchmarks contain hundreds of features. Closest to our work is [27] , which uses SAT encoding to construct a sizeoptimal decision tree for a given set of consistent samples. In contrast to our work, individual nodes and their children relation are modeled explicitly. This means that a path from the root to a leaf is implicit. In principle, one could also restrict the depth of these implicit paths by adding additional counters or some other form of cardinality constraints. This is bound to be less efficient. Further, our encoding is closer to the idea of a tree. If the tree is modeled through nodes, it must be ensured that is in fact a tree via cardinality constraints-ensuring that each node has one and only one parent (except for the root) and that each internal node has two children. These cardinality constraints are not needed in our encoding. Since in our case, classes are per path rather than node we save half of the semantic constraint (see Sect. 3.1). It is interesting to compare how symmetries are broken in [27] , where restrictions are imposed on the possible children nodes. In our approach paths are ordered lexicographically rather than in an arbitrary order. This order lets us single out the leftmost in the rightmost branches, which turned out to be useful in lower-bounding the depth (Sect. 3.2). We remark that lexicographic order is a popular means of breaking symmetries in general graphs, cf. [12] . Earlier work for minimization of decision tree using Constraint Programming (CP) exists [2] . It was shown in [27] , that the approach by Narodytska et al. strictly outperforms the approach of Bessiere at al. this is most likely to be attributed to the fact that the CP encoding is asymptotically much larger. Synthesis by calls to a SAT/SMT solver has seen increased interest in the recent years, cf. [19, 21, 28] . Haaswi et al. used topology enumeration to synthesize Boolean circuits [9] . The general idea is analogous to our approach (see Sect. 4). However, the set of possible topologies is partitioned differently. The possible topologies are DAGs, whereas they are trees in our case. Topologies in their approach belong to the same partition if they have the same number of nodes at each level (levels are obtained by BFS). This approach is unlikely to give good partitioning for binary trees and is more expensive to encode than our approach. Further, in our approach, the enumeration of topologies simply goes over all possible topologies if the number of nodes is small. The well-known technique of cube-and-conquer (CnC) splits the search-space by a lookahead solver [11, 17] . The lookahead solver is run with a bound, which yields cubes to be decided by a traditional CDCL solver. Compared to our approach, CnC is much more general since it is applicable to any SAT instance, and, the lookahead solver is less likely to generate cubes that will be decided trivially. The downside is that CnC may not come up with a splitting as a human would. Further, the lookahead solver can be very costly. In our preliminary experiments, CnC performs much more poorly than a plain SAT solver on our instances. The order in which cubes are decided is also investigated by Heule et al. [11] . This paper proposes a novel SAT-based encoding for decision trees, which enables natively controlling both the tree's size and depth. We also study search-space splitting by topology enumeration. Our implementation outperforms existing work of [27] but also enables a finer control due to the explicit representation of paths of the tree. This finer control lets us optimize practically interesting instances that had been out of reach. The proposed approaches open a number of avenues for future research. The solving itself could be further improved by better splitting, parallelization, and combining with cube-and-conquer [11] . While some preprocessing of the examples was already used in our optimization techniques (Sect. 3.2), further inspection could be used to draw more information from them, e.g. introduction of extended variables in the spirit of [23] . The proposed techniques could also be integrated into more expressive approaches, e.g. SMT-based synthesis [21] . At the application level, we are investigating the integration of our tool with some greedy approaches, e.g. ensembles, where only limited depth is considered. Or, consider a hybrid between a greedy approach and an exact approach where an exact approach is invoked on smaller sub-problems. It would be interesting to investigate whether trees with a smaller depth are really easier to understand and interpret, and, what is the trade-off between depth and size. Our approach provides the means to exactly quantify these metrics. The experimental evaluation shows that SAT solvers poorly handle a searchspace with many topologies. We believe that this represents an important challenge for the SAT community. INFOCOS with reference PTDC/CCI-COM/32378/2017. The results were supported by the Ministry of Education, Youth and Sports within the dedicated program ERC CZ under the project POSTMAN with reference LL1902. Optimal classification trees Minimising decision tree size as combinatorial optimisation CaDiCaL, Lingeling, PLingeling, Treengeling and YalSAT entering the SAT competition 2017 Handbook of Satisfiability Random forests Classification and Regression Trees Temporal induction by incremental SAT solving Decision tree SAT based exact synthesis using DAG topology families Lower bounds on learning decision lists and trees Cube and conquer: guiding CDCL SAT solvers by lookaheads Optimal symmetry breaking for graph problems Solving and verifying the boolean pythagorean triples problem via cube-and-conquer Optimal sparse decision trees ParamILS: an automatic algorithm configuration framework Constructing optimal binary decision trees is NPcomplete Partitioning SAT instances for distributed solving PySAT: a python toolkit for prototyping with SAT oracles A SAT-based approach to learn explainable decision sets Learning SMT(LRA) constraints using SMT solvers Computer-aided proof of Erdős discrepancy properties Factoring out assumptions to speed up MUS extraction Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI 2018) Prime implicant computation using satisfiability algorithms MSCG: robust core-guided MaxSAT solving Learning optimal decision trees with SAT Assessing heuristic machine learning explanations with model counting PMLB: a large benchmark suite for machine learning evaluation and comparison Scikit-learn: machine learning in Python A structure-preserving clause form translation Induction of decision trees C4.5: Programs for Machine Learning Pseudo-boolean and cardinality constraints Artificial Intelligence: A Modern Approach Conflict-driven clause learning SAT solvers Learning optimal decision trees using constraint programming Learning optimal classification trees using a binary linear program formulation Acknowledgements. This work was supported by national funds through FCT, Fundação para a Ciência e a Tecnologia, under project UIDB/50021/2020, the project search (search (2)). The suffix -T-O topology-based is search with heuristic order-