key: cord-0046552-ttuoax9s
authors: Goertzel, Zarathustra Amadeus
title: Make E Smart Again (Short Paper)
date: 2020-06-06
journal: Automated Reasoning
DOI: 10.1007/978-3-030-51054-1_26
sha: 06ec04ddd892774db9ec3cbe84a00ed83f29ba65
doc_id: 46552
cord_uid: ttuoax9s

In this work in progress, we demonstrate a new use-case for the ENIGMA system. The ENIGMA system using the XGBoost implementation of gradient boosted decision trees has demonstrated high capability to learn to guide the E theorem prover’s inferences in real-time. Here, we strip E to the bare bones: we replace the KBO term ordering with an identity relation as the minimal possible ordering, disable literal selection, and replace evolved strategies with a simple combination of the clause weight and FIFO (first in first out) clause evaluation functions. We experimentally demonstrate that ENIGMA can learn to guide E as well as the smart, evolved strategies even without these standard automated theorem prover functionalities. To this end, we experiment with XGBoost’s meta-parameters over a dozen loops.

State-of-the-art saturation-based automated theorem provers (ATPs) for firstorder logic (FOL), such as E and Vampire [12] , employ the given clause algorithm [13] , translating the input FOL problem T ∪{¬C} (background theory and negated conjecture) into a refutationally equivalent set of clauses. The search for a contradiction is performed maintaining sets of processed (P ) and unprocessed (U ) clauses (the proof state Π). The algorithm repeatedly selects a given clause g from U , moves g to P , and extends U with all clauses inferred with g and P . This process continues until a contradiction is found, U becomes empty, or a resource limit is reached.

Historically, term ordering, together with literal selection, is used to guarantee the completeness of the proof search [1] and to "tame the growth of the search space and help steer proof search" [5] . Term ordering ensures that rewriting happens in only one direction, toward smaller terms. Literal selection limits the inferences done with each given clause g to the selected literals, which slows down the growth of the search space and reduces redundant inferences.

E includes a strategy language of clause evaluation functions, made up of weight and priority functions, to heuristically guide the proof search. In this work, I use two algorithmically invented [6, 7] strategies, E1 and E2 1 , that use many sophisticated clause evaluation functions, the Knuth-Bendix ordering (KBO6), literal selection, and other E heuristics.

The ENIGMA [4, [8] [9] [10] system with the XGBoost [2] implementation of gradient boosted decision trees has recently demonstrated high capability to learn to guide the E [14] theorem prover's inferences in real-time. ENIGMA uses the XGBoost model as a clause evaluation function to recommend clauses for selection based on clause and conjecture features. In particular, after several proving and learning iterations, its performance on the 57880 problems from the Mizar40 [11] benchmark improved by 70% (= 25397/14933) [10] over the strategy E1 used for the initial proving phase.

In this work, E is stripped to the bare bones by disabling term ordering and literal selection. KBO6 is replaced with an identity relation as the minimal possible ordering (called IDEN -an addition to E 2 ). While this frees E to do inferences in any order, E can no longer perform rewriting inferences. The strategy E1 is replaced with the simple combination of the clause weight and FIFO (first in first out) evaluation functions. E is thus practically reduced to a basic superposition prover, without advanced heuristics, rewriting, or completeness guarantees. We call this strategy E0:

E0 solves only 3872 of the Mizar40 problems in 5 s compared to 14526 for E1. The first research question is the extent to which ENIGMA with this basic prover can learn ATP guidance completely on its own. The second is to what extent ENIGMA's learning can be boosted with data from strong strategies and models. That is, I explore how smart machine learning can become in this zero-strategy setting. The more general related question is to what extent can machine learning replace the sophisticated human-invented theorem-proving body of wisdom used in today's ATPs for restricting advanced proof calculi.

We evaluate ENIGMA with the basic strategy, E0, in several scenarios and over two datasets of different sizes. All experiments are run with 5 s per problem 3 4 .

1 Strategies E1 and E2 are displayed in the appendix. 2 The E version used in this paper can be found at https://github.com/zariuq/ eprover/tree/identity-order, and the library for running ENIGMA with E can be found at https://github.com/zariuq/enigmatic. 3 As a rule of thumb, E solves most problems within a few seconds or not for a very long time. 4 All the experiments are run on the same hardware unless otherwise specified: Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz with 188GB RAM.

ENIGMA has so far been used in two ways: coop combines the learned model with some standard E strategy equally (50:50) while solo only uses the learned model for choosing the given clauses. The best results have been achieved by MaLARea-style [16] looping: that is, an ENIGMA model is trained and run with E (loop 0), then the resulting data are added to the initial training data and a new ENIGMA model is trained (loop 1).

In this work, ENIGMA trains with both solo and coop data. I present results from solo runs because they represent the most minimal setting.

The E evaluations and XGBoost training can take a long time on the full Mizar40 dataset, so 2000 randomly sampled problems are used to test meta-parameters on. Each XGBoost model consists of T decision trees of depth D, the most important training meta-parameters in addition to the learning rate (η = 0.2). In previous work with ENIGMA, T and D were fixed for all loops of learning.

Here we try to vary the values of T and D during 16 loops. Let S D,T denote the experiment with specific T and D. Of the many protocols tested, the following are included in the plot of solved problems (above): Fives (S 5,100 ), Nines (S 9,100 ), Thirteens (S 13,200 ), Sixteens (S 16,100 ).

We also experiment with adaptively setting the meta-parameters as the number of training examples increases according to the following protocols:

-Inc (S [3, 33] ,100 ) increases D by 2 from 3 to 33 and keeps T = 100 fixed.

-32 inc (S 32, [50, 250] ) fixes D = 32 and gradually increases T from 50 to 250.

-Inc2 (S [3, 33] , * ) gradually decreases T from 150 to 50, varying the value intuitively 5 .

-Inc3 (S [3, 33] , [50, 250] ) aims to be more systematic and steps T from 50 to 250.

-Dec3 (S [3, 33] , [250, 50] ) decreases T from 250 to 50.

At the 16th loop Inc's performance is best, solving 299 problems, doubling the performance of E0 (152). However Inc2 and Inc3 solve 298 problems and 32 inc solves 291 problems. The conclusion is that simple protocols work well so long as T or D is incremented adaptively rather than fixed.

These experiments are done on the large benchmark of 57880 Mizar40 [11] problems from the MPTP dataset [15] . E1 and E2 are two strong E strategies that solve 14526 and 12788 problems.

-Experiment 1 is done with D = 9 and T = 200 and uses our previously trained model that allowed us to solve 25562 problems when cooperating with E1 in our previous experiments [10] . 14 and is decreased to allow the data to fit on the RAM, down to 32 (= 2 5 ).

As seen in the figure, the strong model does not help much in guiding E without ordering or selection in Exp. 1. Exp. 2 learns gradually and catches up with Exp. 1, but seems to plateau around 10,000. Surprisingly the pure Exp. 3 learns fast with the small feature size, but plateaus and drops in performance (perhaps due to overfitting). Exp. 4 indicates that guidance is useful and surpasses E2 with 13805 in round 13. Exp. 5 solves 15990 problems, showing that ENIGMA can take E0 beyond the smart strategies with appropriate parameters and boosting. This is a great improvement over the 3872 problems solved by E0.

ENIGMA can learn to guide the E prover effectively even without smart strategies and term orderings. The models confer a 256% increase over the naive E0 after 13 rounds of the proving/learning loop, and even trained without guidance data, a 121% increase.

The experiments indicate that machine learning can be used to fully control an ATP's guidance, learning to replace orderings, heuristic strategies, and deal with the increase in generated clauses without literal selection. However the combination of ENIGMA and standard ATP heuristics still significantly outperforms ENIGMA alone.

Given the large symmetry-breaking impact of these methods in classical ATP, future work includes, e.g., training the guidance in such a way that redundant (symmetric) inferences are not done by the trained model once it has committed to a certain path. This probably means equipping the learning with more history and knowledge of the proof state in the saturation-style setting. ENIG-MAWatch [4] may aid with symmetry breaking by focusing the proof search on particular proof paths. Additional work is needed to isolate the factors in Exp. 5's performance, and determine the most effective boosting methods in addition to increasing D and T with training loops. Ablation studies should be done to discover the impact of term ordering and literal selection individually on E and ENIGMA's performance. Perhaps term ordering alone is sufficient to train good ENIGMA models.

Running ENIGMA without term ordering and other restrictions is important because it may allow us to combine training data from different strategies, and it may allow ENIGMA to find novel proofs.

Strategy E1 is:

--definitional-cnf=24 --split-aggressive --simul-paramod --forward-context-sr --destructive-er-aggressive --destructive-er --prefer-initial-clauses -tKBO -winvfreqrank -c1 -Ginvfreq -F1 --delete-bad-limit=150000000 -WSelectMaxLComplexAvoidPosPred -H'(1*ConjectureTermPrefixWeight (DeferSOS,1,3,0.1,5,0,0.1,1,4 

--definitional-cnf=24 --split-aggressive --split-reuse-defs --simul-paramod --forward-context-sr --destructive-er-aggressive --destructive-er --prefer-initial-clauses -tKBO -winvfreqrank -c1 -Ginvfreq -F1 --delete-bad-limit=150000000 -WSelectMaxLComplexAvoidPosPred -H'( 3*ConjectureRelativeSymbolWeight(PreferUnitGroundGoals,0.1,100,100,50,100,0.3,1.5,1.5), 4*FIFOWeight(PreferNonGoals), 5*RelevanceLevelWeight2(ConstPrio,1,0,2,1,50,-2,-2,100,0.2,3,4))'

In this section I include the details for intuitively toggled protocols. Protocol Inc2 is as follows: The protocol for Exp. 2 is as follows: The protocol for Exp. 5 requires some explanation. The motivation is to see how far E0 can be taken, even if the methods are too CPU-intensive for a thorough grid search.

Exp. 2 and Exp. 4 demonstrate the utility of boosting. Thus to create better boosting data I trained ENIGMA for 10 loops with strategies E1 through E12 and used this as boosting data for the first 4 of 10 loops of training. In addition to training E0, and in the spirit of ablation studies, I also trained ENIGMA models for E0 with KBO ordering (and no literal selection) and for E0 with KBO ordering and restricted literal comparisons. The motivation is that these versions may serve as a bridge between standard E and the basic E0.

Then I used these results to boost an ENIGMA model in loop 0, and trained based on this for 10 loops, proving 9759 problems.

Finally this data and the data from a loop 3 ENIGMA model trained with E1 is used to boost E0 with the following meta-parameters: 

Rewrite-based equational theorem proving with selection and simplification

XGBoost: a scalable tree boosting system

ENIGMA-NG: efficient neural and gradient-boosted inference guidance for E

ENIGMAWatch: proofWatch meets ENIGMA

Selecting the selection

Hierarchical invention of theorem proving strategies

Extending E prover with similarity based clause selection strategies

ENIGMA: efficient learning-based inference guiding machine

Enhancing ENIGMA given clause guidance

Hammering Mizar by learning clause guidance

MizAR 40 for Mizar 40

First-order theorem proving and Vampire

A new class of automated theorem-proving algorithms

Faster, higher, stronger: E 2.3

MPTP 0.2: Design, implementation, and initial experiments

MaLARea SG1 -machine learner for automated reasoning with semantic guidance

Acknowledgments. The research topic was proposed by Jan Jakubuv and Josef Urban, and further discussed with them, Martin Suda, and Thomas Tan. I also thank the AITP'20 anonymous referees for their comments on the first extended abstract of this work.