key: cord-0045818-00tal94i authors: Wieczorek, Wojciech; Strąk, Łukasz; Nowakowski, Arkadiusz; Unold, Olgierd title: Grammatical Inference by Answer Set Programming date: 2020-05-23 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50423-6_4 sha: 4d5c137435acb3275d289724597382822ce67e54 doc_id: 45818 cord_uid: 00tal94i In this paper, the identification of context-free grammars based on the presentation of samples is investigated. The main idea of solving this problem proposed in the literature is reformulated in two different ways: in terms of general constrains and as an answer set program. In a series of experiments, we showed that our answer set programming approach is much faster than our alternative method and the original SAT encoding method. Similarly to a pioneer work, some well-known context-free grammars have been induced correctly, and we also followed its test procedure with randomly generated grammars, making it clear that using our answer set programs increases computational efficiency. The research can be regarded as another evidence that solutions based on the stable model (answer set) semantics of logic programming may be a right choice for complex problems. In grammatical inference [9] , a learning algorithm La takes a finite sequence (usually strings) of examples as input and outputs a language description (usually grammars). There are two main types of presentations: (i) A text for a language L is an infinite sequence of strings x 1 , x 2 , . . . from L such that every string of L occurs at least once in the text; (ii) An informant for a language L is an infinite sequence of pairs (x 1 , d 1 ), (x 2 , d 2 ), . . . in Σ * × B such that every string of Σ * occurs at least once in the sequence and d i = true ⇐⇒ x i ∈ L. The inference algorithms that use type (ii) of information are said to learn from positive and negative examples. From the Gold's results [7] , we know that the class of context-free languages (and even regular languages) cannot be identified from presentation (i), but can be identified using presentation (ii). However, de la Higuera [8] showed that it is computationally hard. In this work, the following informant learning environment is exploited. Suppose that the inferring process is based on the existence of an Oracle, which can be seen as a device that: 1. Knows the language and has to answer correctly. 2. Can answer equivalence queries. They are made by proposing some hypothesis to the Oracle. The hypothesis is a grammar representing the unknown language. The Oracle answers Yes in the positive case. In the negative case, the Oracle has to return the shortest string in the symmetric difference between the target language and the submitted hypothesis. Then the following procedure can be applied. Start from a small 1 sample S and k = 1. The parameter k denotes the number of non-terminal symbols in the target grammar. Run an answer set program (or another exact method). Every time it turns out that there is no solution that satisfies all of the constraints, increase k by 1. As long as the Oracle returns a pair (x, d) in response to an equivalent query, add (x, d) to S and run the answer set program again (or respectively another exact method). Stop after the answer is Yes. Unfortunately, there is no guarantee that the procedure will terminate in a polynomial number of steps, even when the target language is regular [1] . The equivalence checking may be done by random sampling. The positive answer could be incorrect, but this probability decreases if the sampling is repeated. A very similar procedure for the induction of context-free grammars was proposed by Imada and Nakamura [11] . However, for the exact searching of k-variable grammar, they used Boolean formulas and applied an SAT solver. We took over their main Boolean variables, treating them as predicates, and then constructed a new encoding founded on answer set programming. In an alternative approach, we used general constraints of Gurobi Optimizer 2 instead of ASP. The most closely related work to CFG identification is by Imada and Nakamura [11] . They proposed a way to synthesize CFGs from positive and negative samples based on solving a Boolean satisfiability problem (SAT). They translated the learning problem for a CFG into a SAT, which is then solved by a SAT solver. The result of the SAT solver satisfying the SAT contains a minimal set of rules (it can be easily changed to a minimal set of variables) that derives all positive samples and no negative samples. They used one derivation constraint and two main types of Boolean variables: Derivation variables. A set of derivation variables represents a relation between nonterminal symbols and substrings (in other words, derivation or parse tree) of each (positive or negative) sample w as follows: for any substring x of w and p ∈ V , the derivation variable T p x represents that the nonterminal p derives the string x. Rule variables. A set of rule variables represents a rule set as follows: for any p, q, r ∈ V , a ∈ Σ, a variable R p qr (or R p a ) determines whether the production rule p → q r (or p → a) is a member of the set of rules or not. The derivation constraint is a set of following Boolean expressions for any string a 1 · · · a n (n > 1) and nonterminal p ∈ V . Nakamura et al. have been working on another approach for incremental learning of CFGs implemented in the Synapse system [15] . This approach is based on rule generation by analyzing the results of bottom-up parsing for positive samples and searching for rule sets. Their system can also learn similar CFGs but does it only from positive samples. Both methods synthesized similar rule sets for each language in their experiments. They reported that the computation time by the SAT-based approach is rather shorter than Synapse in most languages. The purpose of the present proposal is to investigate to what extent the power of an ASP solver makes it possible to tackle the context-free inference problem for large-size instances and to compare our approach with the original one. Because of the possibility of future comparisons with other methods, the Python implementation 3 of our winning method is given via GitLab. The main original scientific contributions are as follows: -the formulation of the induction of a k-variable context-free grammar in terms of logical rules with answer set semantics; -the formulation of the induction of a k-variable context-free grammar in terms of general constraints; -the construction of an informant learning algorithm based on ASP, CSP, and SAT solvers; -the conduct of an appropriate statistical test in order to determine the fastest CFG inference method. This paper is organized into five sections. In Sect. 2, we present necessary definitions and facts originating from formal languages and declarative problem-solving. Section 3 describes our inference algorithms: (a) based on solving an answer set program, and (b) based on solving a constraint satisfaction program, including general constraints such as AND/OR. Section 4 shows the experimental results of our approaches in comparison with the original one. Concluding comments are made in Sect. 5. We assume the reader to be familiar with basic context-free languages theory, e.g., from [10] , so that we introduce only some notations and notions used later in the paper. An alphabet is a finite, non-empty set of symbols. We use the symbol Σ for the alphabet. A word is a finite sequence of symbols chosen from the alphabet. We denote the length of the word w by |w|. The empty word ε is the word with zero occurrences of symbols. Let x and y be words. Then xy denotes the catenation of x and y, that is, the word formed by making a copy of x and following it by a copy of y. As usual, Σ * denotes the set of words over Σ. is a prefix of a suffix. A set of words, all of which are chosen from some Σ * , where Σ is a particular alphabet, is called a language. A context-free grammar (CFG) is defined by a quadruple G = (V, Σ, P, v 0 ), where V is an alphabet of variables (or sometimes non-terminal symbols), Σ is an alphabet of terminal symbols such that V ∩ Σ = ∅, P is a finite set of production rules in the form A → α for A ∈ V and α ∈ (V ∪ Σ) * , and v 0 is a special non-terminal symbol called the start symbol. For simplicity's sake, we We call this rewriting a derivation. For any two sentential forms x and y, we write x ⇒ * y, if there exists a sequence x = x 0 , x 1 , x 2 , . . . , x n = y of sentential forms such that x i ⇒ x i+1 for all i = 0, 1, . . . , n − 1. The language L(G) generated by G is the set of all words over Σ that are generated by G; that is, A language is called a context-free language if it is generated by a context-free grammar. Assume that G is the unknown (target) CFG to be identified. An example (a positive word ) of G is a word in L(G), and a counter-example (a negative word ) of G is a word not in L(G). A normal form for context-free grammars is a form, for which any grammar can be converted to the respective normal form version. Amongst all normal forms for context-free grammars, the most useful and the most well-known one is the Chomsky normal form (CNF). A grammar is said to be in Chomsky normal form if each of its rules is in one of two possible forms: We will briefly introduce the idea of answer set programming (ASP). Those who are interested in a more detailed description of the topic, alternative definitions, and the formal specification of this kind of logic programming are referred to handbooks [3, 6] , and [12] . A variable or constant is a term. An atom is a(t 1 , . . . , t n ), where a is a predicate of arity n and t 1 , . . . , t n are terms. A literal is either a positive literal p or a negative literal ¬p, where p is an atom. A rule r is a clause of the form where a 0 , . . . , a m are atoms. The atom a 0 is the head or r, while the conjunction a 1 ∧ · · · ∧ a k ∧ ¬a k+1 ∧ · · · ∧ ¬a m is the body of r. By H(r), we denote the head atom, and by B(r) the set {a 1 , . . . , a k , ¬a k+1 , . . . , ¬a m } of the body literals. B + (r) (B − (r), resp.) denotes the set of atoms occurring positively (negatively, resp.) in B(r). A program (also called ASP program) is a finite set of rules. A ¬-free program is called positive. A term, atom, literal, rule, or a program is ground if no variables appear in it. Let P be a program. Let r be a rule in P, a ground instance of r is a rule obtained from r by replacing 4 every variable X in r by constants occurring in P. We denote the set of all the ground instances of the rules occurring in P by ground(P). An interpretation I for P is a set of ground atoms. A ground positive literal Let r be a ground rule in ground(P). The head of r is true w. A model for P is an interpretation M for P such that every rule r ∈ ground(P) is true w.r.t. M . Given a program P and an interpretation I, the reduct P I is the set of positive rules defined as follows: I is an answer set of P if I is the ⊆-smallest model for P I . Over the last years, answer set programming has emerged as a declarative problem-solving paradigm. It is a programming methodology rooted in research on artificial intelligence and computational logic, and researchers use it in many areas of science and technology. For experiments we took advantages of clingo-one of the most efficient and widely used answer set programming system available 5 today. In addition to standard definitions, clingo allows to define constraints, i.e., rules with the empty head, for instance By adding this constraint to a program, we eliminate its answer sets that contain a(t). Adding the 'opposite' constraint eliminates those answers that do not contain a(t). A constraint can be translated into a normal rule. To this end, the constraint is mapped onto the rule x ← a 1 ∧ · · · ∧ a k ∧ ¬a k+1 ∧ · · · ∧ ¬a m ∧ ¬x (6) where x is a new atom. Example. Suppose we have three numbered urns and two distinguishable balls. Every ball has been put to an urn, maybe to the same. An ASP program to code this knowledge is as follows: Please notice that as usual in logic programming, identifiers with initial uppercase letters are assigned to variables. Rules 7-11 are simple facts concerning urns and balls. Rules 12 and 13 define predicates that tell whether a ball is inside in a particular urn. Inequality U = V is only used during grounding to eliminate some ground instances of rule 13. It is worth mentioning that grounding systems do not make unnecessary replacements, for example, 1 for U . Rules 14 and 15 ensure that every ball is exactly in one urn. Suppose now that we have discovered that urn 2 is empty and we want to know possible configurations. It is enough to add two facts: and find all answer sets. A possible answer set is: ball(q), ball(r), urn(1), urn(2), in(r), not in(2, q), not in(2, r), not in(3, q), not in(3, r), contains(1, q), in(q), urn (3), contains(1, r), which describes the placement of both balls into the first urn. clingo also allows using choice constructions, for instance: describes all possible ways to choose which two of the atoms p(1, q), p(2, q), p(3, q) and which two of the atoms p(1, r), p(2, r), p(3, r) are included in the resultant model. Before and after an expression in braces, we can put integers, which express bounds on the cardinality of the stable models described by the rule. The number on the left is the lower bound (0 is default), and the number on the right is the upper bound (unbounded is default). Our translation converts CFG identification into an ASP program (the main approach) and CSP model (an alternative approach, constraint satisfaction problem). Suppose we are given a sample composed of examples, S + , and counterexamples, S − , over an alphabet Σ, and a positive integer k. We want to find a k-variable CFG G = (V, Σ, P, v 0 ) such that S + ⊆ L(G) and S − ∩ L(G) = ∅. Let 2. The next rules ensure that in a grammar G a factor can or cannot be derived from a specific variable and ensure that in the grammar there is a subset of all possible productions. 3. All examples should be accepted, and no counter-example can be accepted. This time, instead of predicates, w, y, and z are binary variables. We use the following constraints w 0s = 1 for all s ∈ S + (34) and In this section, we describe some experiments comparing the performance of our approaches implemented 6 in Python, using clingo (ASP) and using Gurobi Optimizer, with our implementation of Imada et al. algorithm [11] using the PicoSAT solver (SAT), when positive and negative words are given. For these experiments, we use a set of 40 samples: partly based on randomly generated grammars (33 samples) and partly based on the set of fundamental CFGs appearing in grammatical inference research (the last 7 samples). For testing the learning power for general CFGs, we randomly generated 33 CFGs and prepared positive and negative samples with lengths no longer than 14 exhaustively enumerated for them. The grammars are in Chomsky normal form with 6 to 12 rules on the alphabet {a, b}. In every sample, positive words constitute not less than 20% of the total. The last seven samples are also with lengths no longer than 14 exhaustively enumerated, but they were generated based on the following descriptions: In all experiments, we used Intel Xeon CPU E5-2650 v2, 2.6 GHz (single-core out of eight), under Ubuntu 18.04 operating system with 60 GB available RAM. Algorithm 1 shows the process for synthesizing a grammar (the set of production rules with v 0 being always the start symbol) from positive and negative words. In the algorithm, S + and S − represent the set of positive and negative words Require: S+ positive words, S− negative words, Σ a set of terminal symbols Ensure: G a context-free grammar consistent with S+ and S− S + ← {the shortest word from S+} add appropriately the shortest word from X ∪ Y to S + or to S − end if end loop as an input. The variables S + and S − hold sets of samples to be covered in the next loop iteration. The algorithm picks up a word from S + or S − that is not covered by the inferred grammar G, and add it to S + or S − . The function Convert translates the problem into a set of ASP rules R (or Gurobi general constraints or a Boolean expression). If the ASP solver (or Gurobi Optimizer or the SAT solver) finds a stable model M , the function Extract returns a set of production rules by analyzing the presence of particular y(i, j, l) and z(i, a) atoms. The algorithm repeats this process-increasing k to relaxe the limit on the number of non-terminals-until G covers the all given S + and S − . The results are listed in Table 1 . In order to determine whether the observed CPU time differences between ASP's runs and the remaining methods' runs did not occur by chance, we use the Wilcoxon signed-rank test [17, pp. 915-916] for ASP vs SAT and ASP vs Gurobi. The null hypothesis to be tested is that the median of the paired differences is negative (against the alternative that it is positive). As we can see from Table 2 , p-value is high in both cases, so the null hypothesis cannot be rejected, and we may conclude that using our ASP encoding is likely to improve CPU time performance for most of this kind of benchmarks. Our induction method can also be applied to other data, that are not taken from context-free infinite languages. We tried its classification quality on two bioinformatics datasets: WALTZ-DB database [4] , composed by 116 hexapeptides known to induce amyloidosis (S + ) and by 161 hexapeptides that do not induce amyloidosis (S − ) and Maurer-Stroh et al. database from the same domain [14] , where the ratio of S + /S − is 240/836. We chose a few standard machine learning methods for comparison: BNB (Naive Bayes classifier for multivariate Bernoulli models [13, pp. 234-265] ), DTC (Decision Trees Classifier, CART method [5] ), MLP (Multi-layer Perceptron [16] ), and SVM (Support Vector Machine classifier with the linear kernel [18] ). In all methods except ASP and BNB, an unsupervised data-driven distributed representation, called ProtVec [2] , was applied in order to convert words (protein representations) to numerical vectors. For using BNB, we represented words as binary-valued feature vectors that indicated the presence or absence of every pair of protein letters. In case of ASP, the training set was partitioned randomly into n parts, and the following process was being performed m times. Choosing one part for synthesizing a CFG and use rest n − 1 parts for validating it. The best of all m grammars-in terms of higher F-measure-was then confronted with the test set. For WALTZ-DB n and m have been set to 20, for Maurer-Stroh n has been set to 10 and m to 30. These values were selected experimentally based on the size of databases and the running time of the ASP solver. To estimate the ASP's and compared approaches' ability to classify unseen hexapeptides repeated 10-fold cross-validation (cv) strategy was used. It means splitting the data randomly into 10 mutually exclusive folds, building a model on all but one fold, and evaluating the model on the skipped fold. The procedure was repeated 10 times and the overall assessment of the model was based on the mean of those 10 individual evaluations. Table 3 summarizes the performances of the compared methods on WALTZ-DB and Maurer-Stroh databases. It is noticable that the ASP approach achieved best F-score for smaller dataset (Maurer-Stroh) and an average F-score for the bigger one (WALTZ-DB), hence it can be used with a high reliability to recognize amyloid proteins. BNB is outstanding for the WALTZ-DB and almost as good as ASP for Maurer-Stroh database. In this paper, we proposed an approach for learning context-free grammars from positive and negative samples by using logic programming. We encode the set of samples, together with limits on the number of non-terminals to be synthesized as an answer set program. A stable model (an answer set) for the program contains a set of grammar rules that derives all positive samples and no negative samples. A feature of this approach is that we can synthesize a compact set of rules in Chomsky normal form. The other feature is that our learning method reflects future improvements on ASP solvers. We present experimental results on learning CFGs for fundamental context-free languages, including a set of strings composed of the equal numbers of a's and b's and the set of strings over {a, b} not of the form ww. Another series of experiments on random languages shows that our encoding can speed up computations in comparison with SAT and CSP encodings. Negative results for equivalence queries Continuous distributed representation of biological sequences for deep proteomics and genomics Knowledge Representation, Reasoning, and Declarative Problem Solving WALTZ-DB: a benchmark database of amyloidogenic hexapeptides Classification and Regression Trees Answer Set Solving in Practice Language identification in the limit Characteristic sets for polynomial grammatical inference Grammatical Inference: Learning Automata and Grammars Introduction to Automata Theory, Languages, and Computation, 2nd edn Learning context free grammars by using SAT solvers Answer Set Programming Introduction to Information Retrieval Exploring the sequence determinants of amyloid structure using position-specific scoring matrices Incremental learning of context free grammars based on bottom-up parsing and search Scikit-learn: machine learning in Python Encyclopedia of Research Design Probability estimates for multi-class classification by pairwise coupling