key: cord-0588017-sezizvmo
authors: Brooks, J. P.; Edwards, D. J.; Larson, C. E.; Cleemput, N. Van
title: Conjecturing-Based Computational Discovery of Patterns in Data
date: 2020-11-23
journal: nan
DOI: nan
sha: 854eb4a8216b04aa13647f78d5698cf2ece4f098
doc_id: 588017
cord_uid: sezizvmo

Modern machine learning methods are designed to exploit complex patterns in data regardless of their form, while not necessarily revealing them to the investigator. Here we demonstrate situations where modern machine learning methods are ill-equipped to reveal feature interaction effects and other nonlinear relationships. We propose the use of a conjecturing machine that generates feature relationships in the form of bounds for numerical features and boolean expressions for nominal features that are ignored by machine learning algorithms. The proposed framework is demonstrated for a classification problem with an interaction effect and a nonlinear regression problem. In both settings, true underlying relationships are revealed and generalization performance improves. The framework is then applied to patient-level data regarding COVID-19 outcomes to suggest possible risk factors.

Modern machine learning methods allow one to leverage complex relationships present in data to generate accurate predictions but do not reveal them to the investigator. We propose an automated conjecturing framework for discovering nonlinear and boolean relationships among the features in a given dataset that can be used to enhance the prediction of a response while providing model transparency. In many situations such as medical and financial decision making, knowing the basis for predictions is important for reasons including understanding causal relationships, establishing trust, and assigning liability (Stoyanovich et al. 2020 ). Udrescu and Tegmark (2020) proposed a system called AI Feynman that combines deep learning with methods for symbolic regression to recover nonlinear relationships in data. Impressively, they recover over 100 equations of varying complexity from data. In contrast to AI Feynman, our proposed framework uses Fajtlowicz's Dalmatian heuristic (Fajtlowicz 1995) to discover bounds rather than equations. Further, our framework can also be applied to categorical data to discover sufficient conditions for class membership for classification problems. This work represents the first application of the Dalmation heuristic to learning from data.

To illustrate the potential value of bounds and sufficient conditions, we describe two problems and relevant results from graph theory. This paper extends these ideas regarding bounds and sufficient conditions to learning from data.

A graph is a collection of vertices V and edges E that are ordered pairs of vertices. Consider the problem of finding bounds for the independence number of a graph 1 . It is well known that the linear programming (LP) relaxation of an appropriate integer program provides a upper bound on the independence number (Schrijver 2003) . The Lovász ϑ number of a graph also provides an upper bound that is known to be no larger than the LP relaxation bound for any graph (Lovász 1979) . Therefore, the Lovász ϑ bound dominates the LP relaxation bound, and such relationships are commonly pursued. However, relationships among bounds can be more nuanced. Consider a third bound on the independence number due to Haemers (1979) . For some graphs it is a stronger bound than Lovász ϑ while on other graphs it is a weaker bound; for some graphs, Lovász ϑ is a sharp bound and Haemers's bound is not while for other graphs, Haemers's bound is a sharp bound and Lovász ϑ is not. It remains an open question whether there are a "small" number of bounds where the largest for value for any graph would provide a sharp bound on the independence number. In this paper, we describe a computational approach to discover bounds among continuous features (invariants) in a dataset. As with the independence number, collections of bounds can provide valuable insight into relationships for the system from which the data was collected.

Consider the problem of finding sufficient conditions for a graph to be Hamiltonian 2 . Chvátal (1972) proved that for a graph G with certain conditions on the vertex degrees, G is Hamiltonian.

Also, Chvátal and Erdös (1972) proved that if a graph satisfies a connectivity condition, then it 1 The independence number is the largest number of vertices in a graph no two of which are contained in an edge. The definition of independence number is not important for this example, but only the fact that with every graph is associated a number called the "independence number". 2 A Hamiltonian graph is a graph with a spanning cycle (West 2001) . The definition of Hamiltonian is not important for this example, but only the fact that any graph either is or is not Hamiltonian.

is Hamiltonian. These are two conditions that are sufficient for a graph to be Hamiltonian, but neither implies the other. Some graphs satisfy both conditions, some graphs satisfy one condition, and some graphs satisfy neither condition. The existence and discovery of a (small) set of sufficient conditions that characterize all Hamiltonian graphs remains an open area of research. The pursuit of sufficient conditions of graph properties such as Hamiltonicity mirrors that of invariant bounds (Larson and Van Cleemput 2017) . In the context of learning from data, we show how categorical data, together with invariant bounds discovered among continuous features, can be used as input to a computational approach for generating sufficient conditions for class membership for classification problems. This method provides an alternative to tree-based methods for generating interpretable logical expressions for classification.

To summarize, we develop a framework to discover and use invariant bounds and property conditions for learning from data. The framework can be used for regression and classification problems.

The bounds and conditions produce interpretable models while capturing complex relationships.

The remainder of this paper proceeds as follows. In Section 2, we provide a literature review.

In Section 3, we provide two motivating examples to illustrate the inability of traditional machine learning techniques to reveal nonlinear features to the user. Section 4 introduces our computational conjecturing framework for scientific discovery using data, and includes results for the examples from Section 3. Section 5 reviews the computational requirements of the proposed framework. In Section 6, we compare our conjecturing framework by applying it to datasets used by (Udrescu and Tegmark 2020) and apply an implementation of AI Feynman (Udrescu and Tegmark 2020) to the examples from Section 3. In Section 7, we apply Conjecturing to discover patterns in synthetic patient data regarding COVID-19 outcomes.

The proposed framework builds on and fills gaps in computational scientific discovery, interpretable artificial intelligence (AI), automated feature engineering, and statistical methodology.

Our proposed framework is based on an implementation of Fajtlowicz's Dalmatian heuristic (Fajtlowicz 1995, Larson and Van Cleemput 2017) . The heuristic was originally implemented in Graffiti (Fajtlowicz 1995) which was the first program to produce research conjectures that led to new mathematical theory. The program produces statements that are relations between mathematical invariants which are numerical attributes of examples. Recent implementations of the Dalmatian heuristic have been applied to the discovery of relationships for graphs (Larson and Van Cleemput 2016) and game strategies (Bradford et al. 2020) . The heuristic was adapted to work with properties which are boolean attributes of examples by Larson and Van Cleemput (2017) .

More details on the Dalmatian heuristic are included in the Appendix. We built our framework Article submitted to ; manuscript no. using a more recent implementation of the Dalmation heuristic available here: http://nvcleemp. github.io/conjecturing/.

Symbolic regression has been used as a tool for automated scientific discovery. Symbolic regression is the use of genetic programming to approximate a target function on training data and generalize to produce predictions on new data (Nicolau and Agapitos 2021) and until the work of Schmidt and Lipson (2009) , the focus was on improving prediction accuracy by approximating an underlying function rather than a focus on discovering true functional relationships among features. Schmidt and Lipson (2009) extend previous work to develop a system for discovering laws for dynamical systems by considering relationships among derivatives. Their work led to the development of a software Eureqa. More recently, Udrescu and Tegmark (2020) combined a variety of strategies including dimensional analysis, symmetry identification, neural network training, and brute-force enumeration into a framework called AI Feynman to recover true physical functional forms from data.

Other frameworks have been proposed to computationally generate conjectures from data and discover scientific laws. Data smashing is introduced by Chattopadhyay and Lipson (2014) as a method for computing dissimilarities from streams of data (e.g., electroencephalogram data) to aid in revealing relationships among observations. Jantzen (2020) proposes an algorithm with the similar purpose of detecting types of dynamical systems called dynamical kinds. Subsequently, these kinds "are then targets for law-like generalization" (Jantzen 2020). While Jantzen's work provides a method for discovering the kinds, it does not suggest how to recover the "laws". It is these relationships that we aim to discover with the proposed conjecturing framework.

Our work is distinguished from these previous works in that 1) we focus on generating bounds for invariants that serve as hypotheses for the investigator rather than recover true functional forms or generate accurate predictions, 2) our invariant conjecturing algorithm is paired with a property conjecturing algorithm for discovering both nonlinear bounds and boolean relationships as sufficient conditions for classification, 3) our framework is designed for a given static observational dataset rather than on discovering laws for dynamical systems, and 4) rather than a stochastic search over the space of functional forms, our proposed system leverages sophisticated techniques for enumerating expressions of increasing complexity (described in (Larson and Van Cleemput 2016) for "noiseless" data involving mathematical objects such as graphs). In our system, the human remains "in the loop" to evaluate the plausibility of suggested bounds and conditions. Langley (2019) provides a review of past efforts in computational scientific discovery. Several frameworks have their origins in analyzing mass spectroscopy and other electrochemical data.

Bacon (Langely et al. 1987 ) is a general framework for scientific discovery based on suggesting and executing a series of designed experiments. Tallorin et al. (2018) proposed a method called POOL that uses Bayesian optimization and machine learning in an iterative fashion for experiments to discover peptide substrates for enzymes. Bacon and POOL both make recommendations regarding additional data to collect while our system assumes that a fixed dataset is provided that may or may not be the result of a designed experiment.

Precise definitions of "explainability" and "interpretability" are still being developed (Vilone and Longo 2020 , Lu et al. 2019 , Fürnkranz et al. 2020 ) as research in the area has rapidly accelerated.

According to the convention of Rudin (2019) , explainability is concerned with post-hoc analyses of black box models to create simple explanations of model behavior. Motivated by observed accuracies of deep learning models, work in this area includes identifying important features for prediction, building simple local models, conducting sensitivity analyses, and deriving prototype examples (Samek and Müller 2019, Elton 2020) . Tsang et al. (2018a Tsang et al. ( ,b, 2020 develop neural network frameworks for identifying sets of features for which there is an interaction -a non-additive relationship among predictive features that influence a response value. These methods provide explainability in that they identify sets of features that interact, but the framework is not designed to reveal the functional form of the nonlinear interaction. Akyüz and Birbil (2021) ). Different from these works, our proposed framework automates the discovery of nonlinear features. In addition, as with work on decision rules in general, our framework can combine the discrete features in data with the discovered nonlinear features to discover a potentially richer set of boolean relationships when compared to optimization-based trees and decision lists. Khurana et al. (2018) propose a system that leverages reinforcement learning to search expression trees for predictive features. ExploreKit (Katz et al. 2016 ) is a framework for automatic feature engineering that combines features using basic arithmetic operations and then uses machine learning to rank and select those with high predictive ability. The Data Science Machine (Kanter and Veeramachaneni 2015) automatically generates features for entities in relational databases with possible dependencies between tables followed by singular value decomposition. In none of these works is model transparency evaluated but rather only model performance. An important distinction of our work from these is that they focused on improving prediction accuracy, sometimes at the expense of understandable features, and not on scientific discovery. Article submitted to ; manuscript no.

Traditional statistical methods for empirical model building (e.g. regression analysis) tend to focus on first-and second-order polynomial models; interaction terms up to a certain degree are often included. Empirical models are intended to provide adequate prediction performance while also providing a simple assessment of feature importance via model coefficients. Techniques such as all-subsets, stepwise selection, and regularization methods (e.g., LASSO (Tibshirani 1996) ) are commonly used to perform feature selection over model spaces of increasing complexity. However, domain knowledge is typically required for reciprocal or non-polynomial relationships. While nonlinear regression techniques exist (Seber and Wild 2003, Song et al. 2014) , searching for such relationships is a departure from common practice in statistics. Our proposed framework provides a search over a much broader class of nonlinear functions.

In this section, we describe two datasets where a "typical" knowledge discovery workflow fails to reveal important relationships among features.

Research on machine learning does, of course, lead to conjectured relationships between variables which are in turn used to make predictions of one or more variables in terms of others. A trained neural net, for instance, can be viewed as a black box representing a function which produces an output for every input in its domain. These functions are complex and of a different character than classical scientific laws: in particular, there is little hope of deriving these functions or relationships from simpler existing laws. Our conjecturing framework aims to help fill this gap in current capabilities.

In this example, a continuous target variable is determined by a more complex nonlinear relationship with three continuous predictors. Consider measurements including the masses of two objects m 1 and m 2 , their distance r, and the gravitational force between them F . The goal is to recover the dependence of F on m 1 , m 2 , and r, or

where k is the gravitational constant. Following the demonstration by Langely et al. (1987) , we create a fictional dataset using a predefined value for k that is a random number between 0 and 1.

For our illustrative example, we generated 1,000 training data points and 1,000 testing data points with k = 0.057098. Values for m 1 , m 2 , and r are samples from Uniform(1,100000) distributions, and F is calculated for each sample with no noise. A linear regression model will fail to capture the nonlinear interaction of the variables. Offthe-shelf machine learning methods such as random forests and neural networks can leverage the nonlinear relationship in the data but cannot present the relationship to the investigator. In the next section, we propose a framework for producing bounds on F that are functions of the other features.

The second example is a case where a binary target variable is completely determined by the product of two continuous features in the dataset; i.e., the second-order interaction term completely defines the relationship.

Consider a dataset on residential real estate properties for sale obtained from https://www. redfin.com. The goal is to predict whether a home with given feature values has a list price above or below $300,000.

This dataset includes both the price per square foot and total square footage along with eight additional features such as the number of bathrooms and bedrooms. The target (above vs. below)

can be determined by multiplying the price per square foot by square footage and setting a threshold. Thus, the interaction of price per square foot and square footage, hereafter called the active interaction, completely describes the relationship between the predictors and response. Data are partitioned into a training dataset with 1,000 houses and a testing dataset with 30,156 houses. In the next section, we leverage our framework for invariant bounds and then extend it to produce boolean relationships to recover the active interaction term and how it determines class membership.

We now describe a framework that leverages a conjecturing algorithm to discover nonlinear and boolean feature relationships that can enhance understanding and predictive ability. We assume tighter than all previously-found bounds for at least one example. This requirement is how the Dalmatian heuristic is applied to continuous data (Fajtlowicz 1995) . See the Appendix for more information on the Dalmatian heuristic. Figure 1 displays (a) upper bounds and (b) lower bounds derived for test instances for the gravity data. The gray curves correspond to bounds, and each must be the best on at least one training example in order to be retained. This requirement is how the Dalmatian heuristic is applied to categorical data (Fajtlowicz 1995) .

See the Appendix for more information on the Dalmatian heuristic. 

The invariant version (Conjecturing-INV) of the conjecturing method can be used for regression.

For a regression problem, the framework produces nonlinear functions of the original variables that can be used to predict the response. The functions can then be used as features in a regression model.

We assume that a dataset (x i , y i ) ∈ R m × R is given, where the y i are values for the response variable. Invariant conjectures are generated that provide upper and lower bounds on the response.

These conjectures are the nonlinear functions that can be used as new features and/or as a complete model for the system. The framework is presented in Algorithm 1.

For the gravity case from Section 3.1, the invariants are F , m 1 , m 2 , and r (

Step 2). The examples are the observations in the data (

Step 3).

The conjecturing framework is not designed to recover constants such as the gravitational constant k. In general, for a functional relationship with a constant k such that 0 < k < 1, the expression without the constant provides a lower bound for the response and the reciprocal expression provides an upper bound. In cases where the constant is not between 0 and 1, the converse is true. For this example, F = km 1 m 2 /r 2 with 0 < k < 1 and so r 2 /m 1 m 2 ≤ F ≤ m 1 m 2 /r 2 . The conjecturing framework can potentially recover the bounds F ≤ m 1 m 2 /r 2 or F ≥ r 2 /m 1 m 2 . 

which approximates the true gravity relationship used to generate the data. The bound does not include the constant k. The lower bound of r 2 /m 1 m 2 is not recovered. Other bounds generated by Eight of the upper bounds and 15 of lower bounds for F are depicted in Figure 1 . The upper bound m 1 m 2 /r 2 in Figure 1 (a) is blue, while the true value km 1 m 2 /r 2 is gold.

In this example, the proposed conjecturing framework recovers the true nonlinear relationship up to a constant of proportionality along with 44 additional suggested bounds. Therefore, isolating a single true bound, in the case where the bound is unknown, can require additional analysis and/or experiments. The additional bounds can provide potential insight into feature interactions.

Expression trees for (a) an upper bound on square footage x1/x2 + x3 where x1 is threeHundredK, x2

is pricePerSquareFoot, and x3 is bathrooms and (b) gravitational force km1m2/r 2 .

Algorithm 1 Conjecturing framework for regression models. Input: Data (x i , y i ) ∈ R m × R, i = 1, . . . , n.

Output: A set of invariant relations R. Article submitted to ; manuscript no.

Our proposed framework for classification leverages the invariant version (Conjecturing-INV) and then the property version (Conjecturing-PROP) of the conjecturing algorithm (Algorithm 2). For a classification problem, the framework is designed to produce new binary features that indicate whether nonlinear patterns are satisfied by the original feature values for each observation.

These new features can augment the original features in a classification model, capturing nonlinear patterns while maintaining interpretability.

We assume that a dataset with class labels (x i , y i ) ∈ R m × Y, i = 1, . . . , n is given. We also assume We now provide further details on Algorithm 2 using the real estate valuation case from Section 3.2 as an illustrative example. First, we binarize the categorical feature propertyType into binary features condo, mobileHome, singleFamily, townhouse, multiFamily2-4Unit, multifFamily5PlusUnit,

and Other. We also add a feature that is a constant value of 300,000 for each observation because it is the price cutoff that determines the target and call it threeHundredK. 

Also included are less-interpretable bounds such as:

hoaPerMonth ≤ 10 2×bathrooms + squareFootage

There were also several bounds discovered that are close approximations of the relationship present in the active interaction term, including

squareFootage ≤ threeHundredK/pricePerSquareFoot + bedrooms (12)

For houses with target value above, there are 1,457 bounds derived including a mix of simple relations and less intuitive relations. Also included are the following three relations that are nearly identical to the active interaction relation:

The resulting invariant relations are pooled together (Step 10). The invariant relations are encoded as properties (Step 12). The original binary features from the data are also encoded as ≥ threeHundredK/(pricePerSquareFoot + 1)).

These properties can be used as boolean features that indicate whether a nonlinear relationship among continuous features is satisfied for an observation. Conjecturing-PROP returns only two properties. They both approximate the underlying active interaction.

bathrooms ≥ −threeHundredK/pricePerSquareFoot + squareFootage → isBelow (25)

An inspection of the data reveals that for some of the houses, there is some rounding error when comparing the price to the square footage multiplied by the price per square foot. The conjecturing algorithm compensates by using invariants as error terms. In the first property, the error term is bathrooms × pricePerSquareFoot. In the second property, the error term is squareFootage + 1.

When these properties are applied as classification rules, they produce no error on the training data. The first property misclassifies 37 of 30,156 houses in the test data for an accuracy of 0.999.

The second property misclassifies 26 houses. The misclassified houses are due to rounding error and miscoding of data. For example, one house in the test data is listed as having 31,248 bathrooms and another is listed as having a price of $459.

Despite the noise and rounding error in the data, the conjecturing framework was able to recover the active interaction term and these properties can be used as featues for classifiers with nearperfect accuracy.

Algorithm 2 Conjecturing framework for classification models. Input: Data with class labels (x i , y i ) ∈ R m × Y, i = 1, . . . , n. The original features {1, . . . , m} are comprised of continuous features C and binary features B.

Output: A set of properties P. 

Our implementation of the conjecturing algorithm on a computer with an Intel Core i5-4210U 1.7

GHz processor and 4 GB RAM generates over 7.2 billion expressions using 20 invariants/properties in one minute. At one minute, expressions of up to complexity six are generated. Figure 3 (a) depicts the expression tree for an expression of complexity five involving features from the real estate data. Figure 3 (b) depicts an expression tree for an expression of complexity eight for the formula for gravity, assuming that the gravitational constant k is known.

By default, the conjecturing algorithm includes operators such as adding or subtracting 1, dividing or multiplying by 2, and powers of 10. A gravitational constant of 0.057 could be represented as (+1 + 1 + 1 + 1 + 1) × 10 −1−1 + (+1 + 1 + 1 + 1 + 1) × 10 −1−1−1 which by itself has complexity 19. In Section 4.1, we used the conjecturing framework to recover the functional form of gravity without k which is an expression of complexity 7.

For all instances of Conjecturing-INV and Conjecturing-PROP described thus far for the gravity and real estate data, we use a time limit of five seconds. In the experiments that follow, we investigate the effects of different time limits.

In this section, we compare the performance of Conjecturing on datasets used by Udrescu and Tegmark (2020) to demonstrate their algorithm AI Feynman and then apply the implementation of AI Feynman to the gravity and real estate datasets described in Section 3.

We apply Conjecturing to the first 10 equations listed in Table 4 of (Udrescu and Tegmark 2020) to draw comparisons based on solution time and noise tolerance. We used the data published by the authors here: https://space.mit.edu/home/tegmark/aifeynman.html. As in Udrescu and Tegmark (2020) , for each instance we apply Conjecturing with three sub- the constant π, we include π as a constant invariant. For each instance, we use the first 10 samples in each dataset and run Conjecturing for 5, 100, 1,000, and 10,000 seconds. We also run Conjecturing for the time required by AI Feynman to solve the equations as reported in Table 4 of (Udrescu and Tegmark 2020). Table 1 contains the results of applying Conjecturing to the datasets. Conjecturing produces bounds that match the equation for five of the ten instances. Conjecturing finds a match for all equations with complexity 10 or less and is unable to find a match for equations with higher 

Not found I.8.14 d = (x 2 − x 1 ) 2 + (y 2 − y 1 ) 2 10 10 4 I.9.18

Not found I.10.7 m = m 0 1− v 2 c 2 10 14 I.11.19 A = x 1 y 1 + x 2 y 2 + x 3 y 3 11 Not found I.12.1 F = µN n 3 5 I.12.2 F = q 1 q 2 4π r 2 12 Not found I.12.4 E f = q 1 4π r 2 10 12 complexity. Udrescu and Tegmark (2020) report that AI Feynman resolves all of the equations while Eureqa (Schmidt and Lipson 2009) only resolves four of the ten equations, some of which are different from those found by Conjecturing. These results indicate that Conjecturing is well suited for recovering equations of complexity 10 or less within 10,000 seconds and sometimes within much shorter times. Higher-complexity formulas require additional time.

We applied the implementation of AI Feynman available here: https://github.com/SJ001/ AI-Feynman to the gravity example described in Section 3.1 and the real estate example described in Section 3.2. A difference between the gravity example and the datasets used by Udrescu and Tegmark (2020) is that for the gravity example, the gravitational constant G is the same for every data point, but for the datasets used by Udrescu and Tegmark (2020) , G is treated as a variable and is different for each point. Also, note that AI Feynman is designed for recovering continuous functions and is therefore not suitable for classification problems such as the real estate example.

For the real estate data, we applied AI Feynman to the data to attempt to discover an equation for price as a function of the other variables. For both instances, the AI Feynman implementation aborts with an error regarding an eigenvalue calculation.

In this section, we demonstrate the proposed framework on synthetic patient-level COVID-19 data that was provided as part of the Veterans Health Administration (VHA) Innovation Ecosystem and precisionFDA COVID-19 Risk Factor Modeling Challenge (https://precision.fda.gov/ challenges/11/view). The data include synthetic veteran patient health records including medical encounters, conditions, medications, and procedures.

Article submitted to ; manuscript no.

The goal of the challenge was to better understand risk and protective factors for COVID-19

outcomes. Participants were asked to predict alive/deceased status. Since the goal is to better understand risk and protective factors, we evaluate the performance of Conjecturing by checking the validity of the feature relationships on holdout test data rather than on prediction accuracy.

Predictions were based on information obtained through December 31, 2019. In the training data, we drop all information pertaining to events on or after January 1, 2020 and drop subjects who died before January 1, 2020.

We add a number of features for each patient based on the data, described in the Appendix.

In addition, for each reported allergy, device, immunization, procedure, and discretely-measured observation we create an indicator variable to serve as a property. In total, we use 309 invariants and 362 properties. We applied conjecturing to a training set consisting of 100 subjects from each outcome class (deceased/alive). Upper and lower bounds are generated for each invariant, and for each outcome. These bounds, along with the 362 properties in the data, are used as properties for

Conjecturing-PROP. Conjectures are generated for both outcomes. The parameter "skips" is set to 90%, meaning that an invariant can be used in a conjecture so long as no more than 90% of the examples have a missing value. We use the remaining 73,597 subjects as a test data set thereby allowing us to ascertain the effects of potential overfitting to the 200 subjects used for training. conjectures for sufficient conditions for deceased status produced by the framework.

The five best-performing conjectures for the test data set for each status in terms of precision are provided in Table 2 . Precision is the number of patients who satisfy the premise of the conjecture and have the corresponding status divided by the number of patients who satisfy the premise of the conjecture. If precision is larger than the proportion of the population with the status, then the conjecture provides an increase in prediction ability when compared to selecting an individual at random.

Among the 32 conditions for alive status, 17 have higher proportions of alive subjects satisfying the conditions than among the general population. Among the 40 conjectures for deceased status, 33 have higher proportions of deceased subjects than among the general population.

Consider the sufficient conditions for the deceased status in Table 2 . Those with the single-factor condition of chronic congestive heart failure are more than four times as likely to die of COVID-19 than someone picked at random from the population (31.5% vs. 7.4%).

The second conjecture for a sufficient condition for deceased status in Table 2 implies that lower values of mean CO 2 levels, higher values of platelet distribution width, and proteinuria are associated with deceased status. Even though the conjecture is satisfied by only five patients in the dataset, the selected features and the suggested relationship are worthy of further investigation.

Each of these features has been independently studied and associated with COVID-19 morbidity and mortality (Hu et al. 2021 , Gowda et al. 2021 , Werion et al. 2020 ). In addition to validating a role for these invariants, the conjecture suggests a potential nonlinear relationship among them.

The third conjecture for a sufficient condition for deceased status suggests that shorter time under current care plans, higher total lifetime cost of healthcare encounters covered by payers, and living further east are associated with deceased status. Though the trends for care plan length and cost of healthcare encounters are intuitive, and the functional form is potentially interesting and useful for prognosis. The addition of longitude on the right-hand side could be an indication that residents in the east are at higher risk, or it could be a tolerance term employed by conjecturing to accommodate noise among the other features. The conjecture is satisfied by 1,564 patients in the dataset and 341 of these have a status of deceased.

For both outcomes, the proposed method is able to generate new sufficient conditions that are true for the respective outcome at higher rates than would be expected for a patient selected at random. These results indicate that the conjecturing process is capturing relationships that hold across the population and are not merely reflective of the 200 training samples. In other words, overtraining appears to be mitigated. The number of conjectures generated is not overwhelming for a human investigator to consider and further investigate.

We have demonstrated that automated search for conjectured feature-relations can enhance existing learning algorithms. The discovery of these kinds of feature relationships can also initiate new collaboration with domain scientists and lead to new scientific knowledge.

Our proposed Conjecturing framework was able to recover the functional form for gravity with only the measured force, masses, and distance. The framework also recovered a hidden interaction between price per square foot, square footage, and price in real estate data that leads to improved classification performance. Using synthetic patient-level COVID-19 data, the framework produced conjectures that provide insight into the risk of death.

The current version of Conjecturing requires that conjectures are true for every example. seeks to identify a model by selecting the "best" subset from among 10 features and their associated 45 two-way interactions. In this example, simply considering models with only 10 variables requires searching over a model space larger than 2.9 billion. Future research will investigate the ability of the Conjecturing framework to simplify model spaces and hence, provide a mechanism for a more expeditious search of plausible models.

The algorithm we have used to conjecture feature relations is an adaptation of an algorithm that was originally designed to conjecture invariant relations for mathematical objects.

In the case of graph theory, for instance, the independence number of a graph, the largest number of vertices of the graph for which no pair of vertices is contained in an edge, is a widely-studied NP-hard-to-compute graph invariant. Bounds for this invariant are of interest for both theoretical and practical reasons: bounds for instance can be used to help speed up computation of this number. The program we use was originally designed, for instance, to compute upper and lower bounds for the independence number of a graph.

The program is given the following inputs: an invariant of interest, a bound type (upper or lower) some number of example graphs, and some number of other graph invariants. The program must be able to compute the value of each invariant for each given graph (or at least be provided with their values). Furthermore, the program will produce expressions involving some number of provided unary and binary operators: these are built-in to the program, and not ordinarily modified by the user. Currently these include unary operations like "+1", squaring, square-rooting, and division by 2, and binary operations including "+", "×", and "−".

These can be changed by the user, if desired, in the code.

The program then generates expressions, of progressively greater complexity. The simplest expressions, of complexity 1, are the invariants themselves. The next simplest are unary operators applies to the given invariants; these have complexity 2. In general, the complexity of an expression beginning with a unary operator is one plus the complexity of the remaining expression, while the complexity of an expression beginning with a binary operator is one plus the complexities of the two sub-expressions. In this way all expressions, with progressively greater expressions, can be generated. These expressions are represented by a tree in the program: complexity is defined in terms of the order of the representing tree. Exhaustive generation of all possible expressions of a given complexity corresponds to a tree generation problem.

The program then produces expressions of the form "(invariant of interest) (bound-type) (expression generated from other invariants)". These become statements, true or false, when they are interpreted as being quantified over some domain. The program will continue to generate more and more statements until it reaches a program-defined stopping point: this stopping point may be, variously, when the program can no longer make a better conjecture (in a well-defined sense) or when some timing condition has been reached.

Each produced expression is tested for truth with respect to the examples known to the program. This is the truth test. Secondly, each produced statement must also be non-dominant. A statement is non-dominant if there is at least one object known to the program where the statement gives a better bound for the object than any of the previously accepted statements. For instance, if the program is generating upper bounds for the independence number α of a graph, and knows graphs G 1 , . . . , G k and is considering a potential conjecture C of the form α ≤ β; if C is true for all the graphs known to the program then, and if there is some graph G with β(G) less than the bounds given by all the other previously accepted conjectures then C is non-dominant and is accepted and added to the store of conjectures.

This non-dominance test is Siemion Fajtlowicz's Dalmatian heuristic (Fajtlowicz 1995) and has been used to produce a large number of conjectures that have been investigated by mathematical researchers and have led to publications.

We now describe the Dalmatian heuristic more formally. Let O 1 , . . . , O n be examples of objects of a given type. Let α 1 , . . . , α k be real number invariants. And let α be an invariant for which conjectured upper or lower bounds are of interest. We form a stream of algebraic functions of the invariants: α 1 + α 2 , √ α 1 , α 1 α 3 , (α 2 + α 4 ) 2 , etc. These expressions can then be used to form conjectured bounds for α. If we are interested in upper bounds for α, we can form the inequalities α ≤ α 1 + α 2 , α ≤ √ α 1 , α ≤ α 1 α 3 , α ≤ (α 2 + α 4 ) 2 , etc.

These inequalities can be interpreted as being true for all the objects of the given type. The general approach to generating conjectures is as follows.

1. Produce a stream of inequalities with evaluable functions of the invariants on each side of the inequality symbol. Some of these will pass the truth and non-dominance tests and be stored as conjectures.

Article submitted to ; manuscript no.

Initialize an initial collection of objects. These can be as few as one.

3. Generate conjectures that are true for all stored objects and non-dominant with respect to these objects and the previously stored conjectures. Pass each generated statement to the truth and non-dominance tests.

The program needs a stopping condition. The best case is that, for each object, there is at least one conjecture that gives the exact value for the object. In this case there is no possibility of improving the current conjectures-in the sense that no other conjectures can make better predictions about the values of the existing objects-exact predictive power for all objects has been achieved. In the case where this natural stopping condition is never attained, another stopping condition will be required. One possibility is to simply stop making conjectures after some hardcoded or user-specified time. 

Strong Optimal Classification Trees

Discovering Classification Rules for Interpretable Learning with Linear Programming

Vadalog: A Modern Architecture for Automated Reasoning with Large Knowledge Graphs

Automated conjecturing II: Chomp and Intelligent Game Play

Data Smashing: Uncovering Lurking Order in Data

On Hamilton's Ideals

A note on hamiltonian circuits

Boolean Decision Rules via Column Generation

Self-Explaining AI as an Alternative to Interpretable AI

On Conjectures of Graffiti

On Cognitive Preferences and the Plausibility of Rule-Based Models

Prognosis of COVID-19: Red Cell Distribution, Platelet Distribution Width, and C-Reactive Protein

On Some Problems of Lovász Concerning the Shannon Capacity of a Graph

Logical Analysis of Data -An Overview: From Combinatorial Optimization to Medical Applications

Decreased CO 2 Levels as Indicators of Possible Mechanical Ventilation-Induced Hyperventilation in COVID-19 Patients: A Retrospective Analysis

Jantzen B (2020) Dynamical Kinds and Their Discovery

Deep Feature Synthesis: Towards Automating Data Science Endeavors

ExploreKit: Automatic Feature Generation and Selection

Feature Engineering for Predictive Modeling Using Reinforcement Learning

Scientific Discovery: Computational Explorations of the Creative Process

Scientific Discovery, Causal Explanation, and Process Model Induction

Automated conjecturing I: Fajtlowicz's Dalmatian heuristic revisited

Automated conjecturing III: Property-relations conjectures

On the Shannon capacity of a graph. Information Theory

Good Explanation for Algorithmic Transparency

Choosing Function Sets with Better Generalisation Performance for Symbolic Regression Models

Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead

Learning Customized and Optimized Lists of Rules with Mathematical Programming

Towards Explainable Artificial Intelligence. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning

Distilling Free-Form Natural Laws from Experimental Data

Combinatorial optimization: polyhedra and efficiency

Article submitted to

Random Generalized Linear Model: A Highly Accurate and Interpretable Ensemble Predictor

The Imperative of Interpretable Machines

Discovering de novo Peptide Substrates for Enzymes using Machine Learning

Regression Shrinkage and Selection via the LASSO

Detecting Statistical Interactions from Neural Network Weights

Neural Interaction Transparency (NIT): Disentangling Learned Interacctions for Improved Interpretability. The Thirty-Second Conference on Neural Information Processing Systems

How Does This Interaction Affect Me? Interpretable Attribution for Feature Interactions

AI Feynman: A Physics-Inspired Method for Symbolic Regression

Learning Optimal Classification Trees Using a Binary Linear Program Formulation

Falling Rule Lists

A Bayesian Framework for Learning Rule Sets for Interpretable Classification

SARS-CoV-2 Causes a Specific Dysfunction of the Kidney Proximal Tubule

Introduction to Graph Theory

Author: Conjecturing-Based Discovery of Patterns in Data Article submitted to ; manuscript no.

Core Facility at Virginia Commonwealth University (https://chipc.vcu.edu) were used for conducting the research reported in this work.