key: cord-0043189-pkgh1dsk
authors: Li, Yeting; Chen, Haiming; Zhang, Lingqi; Huang, Bo; Zhang, Jianzhao
title: Inferring Restricted Regular Expressions with Interleaving from Positive and Negative Samples
date: 2020-04-17
journal: Advances in Knowledge Discovery and Data Mining
DOI: 10.1007/978-3-030-47436-2_58
sha: 956ed73c60311f2f22b12ad5424dc56083a42cdd
doc_id: 43189
cord_uid: pkgh1dsk

The presence of a schema for XML documents has numerous advantages. Unfortunately, many XML documents in practice are not accompanied by a schema or a valid schema. Therefore, it is essential to devise algorithms to infer schemas. The fundamental task in XML schema inference is to learn regular expressions. In this paper, we focus on learning the subclass of RE(&) called SIREs (the subclass of regular expressions with interleaving). Previous work in this direction lacks inference algorithms that support inference from positive and negative examples. We provide an algorithm to learn SIREs from positive and negative examples based on genetic algorithms and parallel techniques. Our algorithm also has better expansibility, which means that our algorithm not only supports learning with positive and negative examples, but also supports learning with positive or negative examples only. Experimental results demonstrate the effectiveness of our algorithm.

A classical problem in grammatical inference is to identify a language from positive examples and negative examples. We study learning regular expressions (REs) with interleaving (shuffle), denoted by RE(&). Since RE(&) are widely used in various areas of computer science [1] , including XML database systems [5, 12, 26] , complex event processing [24] , system verification [4, 13, 15] , plan recognition [18] and natural language processing [21, 28] .

Studying the inference of RE(&) has several practical motivations, such as schema inference. The presence of a schema for XML documents has many advantages, such as for query processing and optimization, data integration and exchange [11, 30] . However, many XML documents in practice are not accompanied by a valid schema [16] , making schema inference an attractive research topic [2, 3, 10, 14, 31] . Learning Relax NG schemas is an important research problem for schema inference, since it is more powerful than other XML schema languages, such as DTD or XSD [5] and has unrestricted supports for the interleaving operator. It is known that the essential task in Relax NG schema inference is learning RE(&) from a set of given sample [23, 31] .

Previously, RE(&) learning has been studied from positive examples only [23, 29, 31] . However, negative examples might be useful in some applications. For instance, the schema evolution [8, 9] can be done incrementally, with little feedback needed from the user, when we also allow negative examples. Learning RE(&) from positive and negative examples may have other crucial applications, such as mining scientific workflows. REs have already been used in the literature as a well-suited mechanism for inter-workflow coordination [17] . The user labeled some sequences of modules from a set of available workflows as positive or negative examples. So such algorithms can be thus applied to infer the workflow pattern that the user has in mind.

Such kinds of applications motivate us to investigate the problem of learning RE(&) from positive and negative examples. Most researchers have studied subclasses of REs, which are expressive enough to cover the vast majority of real-world applications [6, 7, 22] and perform better on several decision problems than general ones [6, 7, 19, 20, 25, 27] . Bex et al. [3] proposed learning algorithms for two subclasses of REs: SOREs and CHAREs, which capture many practical DTDs/XSDs and are both single occurrence REs. Bex et al. [2] also studied learning algorithms, based on the Hidden Markov Model, for the subclass of REs in which each alphabet symbol occurs at most k times (k-OREs). More recently, Freydenberger and Kötzing [10] proposed more efficient algorithms for the above-mentioned SOREs and CHAREs. Existing work on RE(&) learning mentioned above [23, 29, 31] are all working on specific subclasses of REs. The aim of these approaches is to infer restricted subclasses of single occurrence REs with interleaving starting from a positive set of words representing XML documents based on maximum clique or maximum independent set.

In this paper, we focus on learning the subclass of RE(&), called SIREs (see Definition 1) [29] . It has been proved that the problem of learning SIREs is NP-hard [29] . Here, we solve this problem by using genetic algorithms and parallel techniques. Genetic algorithms have been used to solve NP problems, and parallel techniques can make programs more efficient. As a result, when given both positive and negative examples, we can effectively learn a SIRE.

The main contributions of this paper are listed as follows.

- 

Regular Expression with Interleaving. Let Σ be a finite alphabet of symbols. The set of all words over Σ is denoted by Σ * . The empty word is denoted by ε. A RE with interleaving over Σ is defined inductively as follows: ε or a ∈ Σ is a RE, for REs r 1 and r 2 , the disjunction r 1 |r 2 , the concatenation r 1 ·r 2 , the interleaving r 1 &r 2 , or the Kleene-Star r * 1 is also a RE. r ? and r + are abbreviations of r|ε and r·r * , respectively. They are denoted as RE(&).

The size of a RE r, denoted by |r|, is the total number of symbols and operators occurred in r. The language L(r) of a RE r is defined as follows:

where a, b ∈ Σ and u , v ∈ Σ * , then u&ε = ε&u = u and u&v = a(u &v)∪ b(u&v ). For example, L(ab cd) = {cdab, cadb, cabd, acdb, acbd, abcd}.

A RE with interleaving r is SOIRE, if every alphabet symbol occurs at most once in r. We consider the subclass of REs with interleaving (SIREs) defined by the following grammar. For instance, a * b ? &cd + is a SIRE, but a + b&c + a is not because a appears twice.

We use candidate region to define the skeleton structure of a SIRE. Let N ={0,1,2 

For a given alphabet |Σ| = n, it is easy to see there are 2 n−1 CRs. For example, consider Σ={a, b, c, d, e} and |Σ|=5. As is shown in Fig. 1 , we can get 16 CRs. The number of squares with the same color represents the |D i |, e.g., the 6th CR denotes 1&1&3 and the 12th CR denotes 1&1&1&2. So, the SIRE r 1 = a + &b&c * d + e ? belongs to the 6th CR 1&1&3 and the SIRE r 2 = a + &b&c * &d + e ? belongs to the 12th CR 1&1&1&2. 

Our algorithm aims to obtain an accurate and precise SIRE, which should accept as many positive samples as possible and reject as many negative samples as possible. We show the major technical details of our algorithm in this section. The main algorithm is presented in Sect. 3.1. Initializing all the simplified candidate regions (SCRs) is introduced in Sect. 3.2. Selecting the best candidate SIRE from each SCR is given in Sect. 3.3.

The algorithm iSIRE first figures out the SCRs of the expression to be learned, then for each SCR, employs genetic algorithms to learn character sequence and multiplicity sequence in parallel, and decodes each learned sequence to a SIRE according to its SCR. After multi-generation evolution and iteration, the best SIRE is selected by function bestRE(). The main procedures of the algorithm are presented in Algorithm 1, and are illustrated as follows.

-Scan positive examples S + and negative examples S to get the alphabet Σ, then call function getSCRs() to initialize all the SCRs based on |Σ|. -In parallel, call algorithm candSIRE to select the best SIRE from each SCR, and put them in the candidate set C. -Call function bestRE() to select the best SIRE from C and output it.

Function bestRE() is designed to select the best SIRE. It measures two metrics of SIREs: K(r) for accuracy and CC(r) [23] for preciseness. For a SIRE r, 

Note that K(r) has a higher priority than CC(r) when selecting the best SIRE. If the value of K(r) is larger, then it means r can accept more positive examples and reject more negative examples. Smaller the CC(r) is, the more precise the SIRE will be. In the rest of this section, we will discuss the implementations of lines 3, 4 and 5 in detail.

Next, we will give a detailed explanation of Line 3 of Algorithm 1 (initializing all the SCRs) in this section.

From Definition 2, when |Σ| = n, there are 2 n−1 CRs. Because of the unorder features of SIREs, we can easily find that for a SIRE r = D 1 & · · · &D n , the order of D i can be arbitrary, where 1 ≤ i ≤ n. Hence, we can merge some equivalent CRs and get the SCRs. For instance, in Fig. 1 , we can merge the 6th CR 1&1&3, the 8th CR 1&3&1 and the 11th CR 3&1&1 together. After the merger of some equivalent CRs, we get the SCRs shown in Fig. 2 .

When the |Σ| = n, how many SCRs are there? This problem is equivalent to Integer Partition, e.g., when |Σ| = 5, there are 7 SCRs, including 5, 4&1, 3&2, 3&1&1, 2&2&1, 2&1&1&1 and 1&1&1&1&1. Meanwhile, the 7 partitions of 5 are: 5 = 5, 5 = 4 + 1, 5 = 3 + 2, 5 = 3 + 1 + 1, 5 = 2 + 2 + 1, 5 = 2 + 1 + 1 + 1 and 5 = 1 + 1 + 1 + 1 + 1. In general, approximation formulas exist that can calculate the number of partitions. For n ∈ N, the number of partitions of n p(n) ≈ 3 is far less than 2 n−1 . It can be seen from Table 1 intuitively that the number of SCRs is far less than CRs. So we use function getSCRs() to get SCRs instead of CRs in our algorithm iSIRE. 

For each SCR obtained in first step, we employ the algorithm candSIRE (shown in Algorithm 2) to find the best candidate SIRE. Because each SCR is independent of each other and does not interfere with each other, we use multi-thread on our multi-core processor to run the candSIRE algorithm in parallel. By using parallel processing, we can infer the best candidate SIRE for each SCR with numerous SIREs simultaneously, which makes a huge difference when there are often hundreds of SIREs to evaluate per SCR. The algorithm candSIRE uses a number of genetic operators. Using the alphabet Σ = {a, b, c, d, e} as an example, we introduce some of them as follows.

-character crossover: in the character population, we randomly select two character sequences p 1 = ebdac and p 2 = eabdc as parents (in Fig. 3) . First, we select the genetic information (bd) and (ab) of the parent p 1 and p 2 at the same position, and put them into the children c 1 and c 2 at the corresponding position respectively. Then we explain how to add the genetic information of the parent p 2 into the child c 1 . Our method starts from the ending position of genetic information of p 2 , and then passes each gene of parent p 2 , which has not appeared in c 1 , to c 1 . In this example, we start with the gene d of p 2 , because d is already in c 1 , we skip it and move to c. Since c is not in c 1 , we can put c to the first available location of c 1 . As we arrived at the end of p 2 , we moved to the first gene e. This time, e is not in c 1 , so we can add e to the next available location of c 1 . Continue the process we get c 1 = cbdea. In the same way, we generate c 2 = cabed, and they are shown in Fig. 3 . -character mutation: the principle of character mutation is to traverse each gene and determine the mutation according to the mutation rate. If the selected gene mutates, the method randomly selects another gene and exchange their positions. For example, for a character sequence p 1 = abcde, we assume the selected gene a mutates, then we select gene c and exchange the position of a and c, thus finally get p 1 = ebcda shown in Fig. 4 . -chromosome encoding: as is shown in Fig. 5 , when we encode SIRE r = d ? b + c * &e + &a, we can extract a SCR 3&1&1, a multiplicity sequence "?+ * +1" and a character sequence dbcea. -chromosome decoding: we decode a SCR 2&2&1, a multiplicity sequence " * ?? + * " and a character sequence abcde to get a SIRE r = a * b ? &c ? d + &e * . The example is shown in Fig. 6 . 1. Initialize the population of candidate character sequences. Here we set the population size to 500. 2. Select the best multiplicity sequence for each character sequence using algorithm selectMuls in parallel. The pseudocode of selectMuls is presented in Algorithm 3, we will explain its details later. 3. Decode each pair of character sequences, corresponding best multiplicity sequences and the given SCR to get the population of candidate SIREs by calling function decode(). 4. Call function calcValues(), calculate fitness value f (r) for each SIRE r. The fitness value f (r) of r is defined as follows.

f (r) = (K(r), CC(r)),

For the detailed definitions of K(r) and CC(r), see Sect. 3.1. Our fitness function gives priority to K(r), and then compare the CC(r), that is, on the basis of selecting the SIRE that can accept more positive examples and reject more negative examples, then consider the more precise ones. 5. Call function select() to generate the next generation SIREs. The method first retains the best 20% of SIREs by the fitness f (r) in the current population unchanged, and then applies roulette-wheel selection to the remaining 80% to get the next generation SIREs. Meanwhile, it is also important to note that K(r) is the top priority when choosing SIREs. When the values of K(r) are the same, we choose the SIRE which CC(r) is minimum. 6. Call function charCrossover(), select some pairs of character sequences according to the crossover rate (0.8), and construct new pairs of character sequences by applying the character crossover.

7. Call function charMutate(), select some character sequences according to the mutation rate (0.03), and modify the selected sequences by applying the character mutation. 8. Iterate 2−8 steps until the number of generations reaches the given threshold C Gmax. Here we set C Gmax=300. Finally, we call function bestRE() (see Sect. 3.1) to select the best SIRE from the last generation of SIREs.

In order to improve the efficiency of evolution, we adopt two tricks to optimize algorithm candSIRE. In the second step of candSIRE, it needs to select the best multiplicity sequence for each character sequence. Obviously, this process can be executed in parallel because these character sequences are independent of each other when finding the best multiplicity sequence. Besides, as is well known, the fitness function is usually the most computationally expensive component of the genetic algorithm. Thus, we use value hashing to reduce the amount of time spent on calculating fitness values by storing previously computed fitness values in a hash table. During execution, solutions found previously will be revisited due to the random mutations and recombinations of SIREs, then we just revisit its fitness value directly from the hash table instead of recalculation. Inevitably, the storage of the hash table consumes memory usage. Now we introduce the algorithm selectMuls used in the second step of algorithm candSIRE (shown in Algorithm 3), it aims to select best multiplicity sequence for each character sequence. Before introducing the details of select-Muls, we illustrate its genetic operators as follows.

-multiplicity crossover: randomly select crossover points of two parents, and then exchange the selected genes to get children. In the multiplicity population, e.g., we randomly select two multiplicity sequences p 1 = " * +1?+" and p 2 = " * ?? + +" as parents. Then we exchange "+1" of p 1 and "??" of p 2 to get children c 1 = " * ???+" and c 2 = " * +1 + +" in The main procedures of selectMuls are illustrated as follows.

1. Initialize the population of candidate multiplicity sequences. Here we set the population size to 200.

2. Call function decode(), decode each group of multiplicity sequences, the character sequences and the SCR to get the population of candidate SIREs. 3. Call function calcValues(), calculate fitness value f (r) for each SIRE r. 4. Call function select(), use roulette-wheel selection to generate a next generation from the current population according to fitness values. 5. Call function mulCrossover(), select some pairs of multiplicity sequences according to the crossover rate (0.8), and construct new pairs of multiplicity sequences by multiplicity crossover. 6. Call function mulMutate(), select some multiplicity sequences according to the mutation rate (0.03), and modify the sequences by applying the multiplicity mutation. 7. Iterate 2-8 steps until the number of generations reaches the given threshold M Gmax. Here we set M Gmax=100. Then, we call function bestRE() to select the best SIRE r from the last generation of SIREs in the given SCR. Finally, we call function encode(r).muls to get a multiplicity sequence. 

In this section, we validate our algorithm by means of experimental analysis. All experiments were performed using a prototype implementation of iSIRE written in Python 3.6 executed on a machine with sixteen-core Intel Xeon CPU E5620@2.4 GHz, 24 GB memory.

To compare the algorithms Exact Minimal [29] , conMiner [29] , conDAG [29] and iSIRE, we generate 9 datasets of positive examples with alphabet size |Σ| = {5, 10, 15} and example size |S| = {100, 500, 1000}. Table 2 presents the K(r) values and CC(r) values of the learned SIREs. From Table 2 we can see that for all the 9 datasets, the K(r) values of learned SIREs with both algorithms are all 100%, which means both algorithms can guarantee the learned SIREs to cover all positive examples 1 . According to the CC(r) values of SIREs learned by the four algorithms in Table 2 , we observe that when the alphabet size |Σ| is smaller (|Σ| = 5), the learned SIREs by Exact Minimal, conMiner, conDAG and iSIRE have the same smaller CC(r) values. However, as the alphabet size |Σ| grows larger (|Σ| = 15), the CC(r) values of learned SIREs by Exact Minimal, conMiner or conDAG is much larger than that of iSIRE, that means the results learned by iSIRE is more precise than the other 3 algorithms. 

In order to evaluate the effectiveness of our learning algorithm on learning examples of both positive and negative cases, we would have liked to compare iSIRE with other approaches, but this was impossible, since we found no other tools or algorithms supporting learning SIREs from both positive and negative examples. Thus we only conducted experiment with our own algorithm. According to the alphabet size, example size, and the proportion of positive and negative examples, we made 27 datasets of examples and conducted experiment on these examples (shown in Table 3 ). As Table 3 shows, more than 74% K(r) values of inferred SIREs are above 75%, that is, majority of SIREs learned by iSIRE accept most of positive examples and reject most of negative examples, which demonstrates the high effectiveness of our algorithms. 

In this paper, we provided algorithm iSIRE to learn a SIRE from positive and negative examples based on genetic algorithms and parallel techniques. Then we conducted experiments with alphabets of different sizes, and results showed that with only positive examples, our learning results are more precise compared with the state-of-the-art algorithms, and when given both positive and negative examples, we can learn SIREs with high accuracy.

Shuffled languages -representation and recognition

Learning deterministic regular expressions for the inference of schemas from XML data

Inference of concise DTDs from XML data

Two-variable logic on words with data

Efficient asymmetric inclusion of regular expressions with interleaving and counting for XML type-checking

Linear time membership in a class of regular expressions with counting, interleaving, and unordered concatenation

Update rewriting and integrity constraint maintenance in a schema evolution support system: PRISM++

Managing semi-structured data

Fast learning of restricted regular expressions and DTDs

Schema profiling of document-oriented databases

W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures

Concurrent regular expressions and their relationship to petri nets

XTRACT: learning document type descriptors from XML document collections

Shuffle languages, petri nets, and context-sensitive grammars

The quality of the XML Web

Workflow and process synchronization with interaction expressions and graphs

Weighted unranked tree automata as a framework for plan recognition

The inclusion problem for regular expressions

The membership problem for regular expressions with unordered concatenation and numerical constraints

Treebank grammar techniques for non-projective dependency parsing

Practical study of deterministic regular expressions from large-scale XML and schema data

Learning concise Relax NG schemas supporting interleaving from XML documents

PIE: approximate interleaving event matching over sequences

Closure properties and descriptional complexity of deterministic regular expressions

BonXai: combining the simplicity of DTD with the expressiveness of XML schema

Complexity of decision problems for XML Schemas and chain regular expressions

Non-projective dependency parsing in expected linear time

Discovering restricted regular expressions with interleaving

Schema management for document stores

Inference of a concise regular expression considering interleaving from XML documents