Data mining for grammatical inference with bioinformatics criteria


Data mining for grammatical inference with bioinformatics criteria

Vivian F. López ⇑, Ramiro Aguilar, Luis Alonso, María N. Moreno
Departament Informática y Automática, University of Salamanca, Plaza de la Merced S/N, 37008 Salamanca, Spain

a r t i c l e i n f o

Keywords:
Grammatical inference
Bioinformatic
Free Context Grammar
DNA
Sequential patterns

a b s t r a c t

In this work a novel data mining process is described that combines hybrid techniques of association
analysis and classical sequentiation algorithms of genomics, to generate grammatical structures of a spe-
cific language. Subsequently, these structures are converted to Context-Free Grammars. Initially the
method applies to context-free languages with the possibility of being applied to other languages: struc-
tured programming, the language of the book of life expressed in the genome and proteome and even the
natural languages. We used an application of a compilers generator system that allows the development
of a practical application within the area of grammarware, where the concepts of the language analysis
are applied to other disciplines, like bioinformatic. The tool allows measuring the complexity of the
obtained grammar automatically from textual data.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

During recent years many approaches were introduced as data
mining methods for pattern recognition in biological database. Bio-
informatics employs computational and data processing technolo-
gies to develop methods, strategies and programs that permit to
handle, order and study the immense quantity of biological data
that have been generated and are currently generated.

To this aim, the computational linguistics has received consid-
erable attention in bioinformatics. The study in Searls et al.
(1999) indicated that a relation exists between formal languages
theory and DNA. Being the linguistic view of DNA sequences, a rich
source of ideas to model strings with correlated symbols. Most of
the work (Jiménez-Montaño, 2009; Jiménez-Montaño, Feistel, &
Diez-Martínez, 2010) has involved examinations of the occur-
rences of ‘‘words’’ in DNA. Searls and Dong (1993) found that such
a linguistic approach proves useful not only in theoretical charac-
terization of certain structural phenomena in sequences, but also
in generalized pattern recognition in this domain, via parsing.
The information represented on sequences involves grammatical
inference for pattern recognition.

In this work a novel data mining process is described that com-
bines hybrid techniques of association analysis and classical
sequentiation algorithms of genomics to generate grammatical
structures of a specific language. Subsequently, these structures
are converted to Context-Free Grammars (CFG). Initially the meth-
od applies to context-free languages with the possibility of being
applied to other languages: structured programming, the language

of the book of life expressed in the genome and proteome and even
the natural languages.

We used an application of the compiler generator so named GAS
1.0 system (López, Sánchez, Alonso, & Moreno, 2009), that repre-
sents an Integrated Development Environment (IDE) which allow
the development of a practical application within the area for the
automatic generation of language-based tools, that starts from
the traditional solutions and facilitates the use of formal language
theory in other disciplines: Grammar-Based Systems (GBSs)
(Mernik, Crepinsek, Kosar, Rebernak, & Zumer, 2004). The tool al-
lows measuring the complexity of the obtained grammar automat-
ically from textual data.

2. Context-Free Grammar

A grammar G is defined like G = (N, T, P, S), where N is the set of
nonterminals symbols, T is the set of terminals symbols, P is the set
of production rules and S is the initial symbol. The language of a
grammar L(G) is the set of all terminal strings w that have deriva-
tions from the initial symbol. This is: L(G) = {w is in T⁄jS ) ⁄w}.

A CFG has production rules like A ? a where A 2 N and
a 2 (N [ T)⁄. The substitution of A by a is carried out independently
of the place in which appear A (Louden, 1997). The majority of the
programming languages are generated by grammars of this type
(enlarged with some contextual elements necessary for the lan-
guage semantics).

2.1. Grammars and bioinformatics

The association analysis involves techniques that are different
in its operations but all of them search relations among the attri-
butes of a data set. Some techniques are:

0957-4174/$ - see front matter � 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2011.08.058

⇑ Corresponding author.
E-mail address: vivian@usal.es (V.F. López).

Expert Systems with Applications 39 (2012) 2330–2334

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a

http://dx.doi.org/10.1016/j.eswa.2011.08.058
mailto:vivian@usal.es
http://dx.doi.org/10.1016/j.eswa.2011.08.058
http://www.sciencedirect.com/science/journal/09574174
http://www.elsevier.com/locate/eswa


� Association rules
� Discovery of sequential patterns (DSP), and
� Discovery of associations.

The association rules (AR) describe the relations of certain attri-
butes with regard to others attributes in a database (DB). These
rules identify cause-effect implications between the different attri-
butes of the DB. Similar to the AR, the discovery of associations
(DA) tries to find implications between different couples attri-
bute-value so that the appearance of these determine a present
association in a good quantity of the registers of the DB.

Discovery of sequential patterns (DSP) is very similar to the AR
but search for patterns between transactions so that the presence
of a set of items precede another set of items in a DB during a per-
iod of time. For example, if the data correspond to registers of arti-
cles purchased by clients, a description of what articles buys
frequently a client can be obtained, and above all, which is the se-
quence of its purchase. Thus, the next time, the profile of the client
would be known, and it will be able to predict the sequence of its
purchase. This criteria can apply to another data control, for exam-
ple, in the bioinformatics context, when the data to treat corre-
spond to the chain of nucleotides of the genome and sequences
are discovered as the patterns that codify genes conform some pro-
tein (Fayyad, Piatetsky-Shapiro, Smyth, & Uthurusamy, 1996).

Bioinformatics employs hybrid techniques to handle, order and
study the immense quantity of biological data that have been gen-
erated and are currently generated. For example, for the human

genome (HG), the bioinformatics seeks to find meaning to the lan-
guage of the more than 37.000 million peers A, C, T and G that have
been compiled and stored in the book of life.

They offer us the opportunity to understand the gigantic DB
that contain the details of the circumstances of time and place in
which the genes are activated, the conformation of the proteins
that specify, the form in which they influence some proteins on
others and the role that such influences can play in the diseases.
Besides, what are the relations of the HG with the genomes of
the model organisms, like the fly of the fruit, the mice and the bac-
teria? Will it be able to discover sequential patterns that show how
are related between itself the fragments of information? and will it
be able to conform a grammatical structure that show the interpre-
tation of the resultant set? If we are able to infer that structure for
this type of language we will contribute to understand the real
function of the structure of the DNA and we will understand
slightly more than the questions presented. One of the applications
of the bioinformatics is the pharmacology, offering reviving solu-
tions to the old model for the creation of new medicines. It is worth
to note that, one of the more elementary bioinformatics operations
consists of the search of resemblances between a fragment of DNA
recently arranged and the already available segments of diverse
organisms (remember and associate this with the DSP). The finding
of approximate alignments permits to predict the type of protein
that will specify such sequence. This not only provides trails on
pharmacological designs in the initial phases of the development
of medicines, but suppresses some that will constitute un resolving

Fig. 1. Using of the bioinformatics in the pharmacology.

V.F. López et al. / Expert Systems with Applications 39 (2012) 2330–2334 2331


https://isiarticles.com/article/22251