An expert system to identify co-regulated gene groups from time-lagged gene clusters using cell cycle expression data


Expert Systems with Applications 37 (2010) 2202–2213
Contents lists available at ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a
An expert system to identify co-regulated gene groups from time-lagged gene
clusters using cell cycle expression data

Li-Ching Wu a, Jhih-Long Huang b, Jorng-Tzong Horng b,d,*, Hsien-Da Huang c

a Institute of System Biology and Bioinformatics, National Central University, Taiwan
b Department of Computer Science and Information Engineering, National Central University, Taiwan
c Institute of Bioinformatics, National Chiao-Tung University, Taiwan
d Department of Bioinformatics, Asia University, Taiwan

a r t i c l e i n f o
Keywords:
Expert system
Data mining
Gene expression
Bioinformatics
0957-4174/$ - see front matter � 2009 Elsevier Ltd. A
doi:10.1016/j.eswa.2009.07.053

* Corresponding author. Address: Department of Bi
Taiwan.

E-mail address: horng@db.csie.ncu.edu.tw (J.-T. Ho
a b s t r a c t

Motivation: The analysis of time series gene expression data can provide us with the opportunity to find
co-regulated genes that show a similar expression patterns under a contiguous subset of experimental
conditions. However, these co-regulated genes may behave almost independently under other condi-
tions. Furthermore, the similarity in the expression pattern might be time-shifted. In that case, we need
to be concerned with grouping genes that share similar expression patterns under a contiguous subset of
conditions and where the similarity in expression pattern might have time lags. In addition, to be consid-
ered a time-shifted similar pattern, because co-regulated genes in a biological process may show a peri-
odic pattern in their cell cycle expression, we also should group genes with periodic similar patterns over
multiple cell cycles. If this is carried out, we can regard such grouped genes as cell-cycle regulated genes.
Results: We propose a method that follows the q-cluster concept [Ji, L., & Tan, K. L. (2005). Identifying
time-lagged gene clusters using gene expression data. Bioinformatics, 21(4), 509–516] and further
advances this approach towards the identification of cell-cycle regulated genes using cell cycle micro-
array data. We used our method to cluster a microarray time series of yeast genes to assess the statisti-
cally biological significance of the obtained clusters we used the p-value obtained from the
hypergeometric distribution. We found that several clusters provided findings suggesting a TF–target
relationship. In order to test whether our method could group related genes that other methods have
found difficult to group, we compared our method with other measures such as Spearman Rank Correla-
tion and Pearson Correlation. The results of the comparison demonstrate that our method indeed could
group known related genes that these measures regard as having only a weak association.

� 2009 Elsevier Ltd. All rights reserved.
1. Introduction

DNA microarray technology enables the simultaneous study of
gene expression levels on a large scale. Expression level is the log-
arithm of the abundance of the mRNA of a gene under a specific
condition. The gene expression data of a microarray is arranged
as a data matrix. Each gene corresponds to one row and each
condition to one column. Each element of this matrix represents
the expression level of a gene under a specific condition. The
conditions of a microarray may be different time points, different
environmental conditions or different organs. The analysis of
microarray data has facilitated the study of genetic regulatory
networks. The correlation patterns of genes with experimental
conditions can be used to identify the networks that are comprised
ll rights reserved.

oinformatics, Asia University,

rng).
of correlated genes and thus how the correlated genes interact
with each other.

Clustering methods can be applied to either the genes or the
conditions of the microarray matrix separately. However, some
problems occur when applying clustering to the analysis of gene
expression data. A set of genes may simultaneously activate a par-
ticular biological process over certain contiguous conditions but
behave independently under other condition. Therefore, we need
to group genes that have similar behavior under a specific subset
of the conditions. Clustering can not satisfy this requirement. The
biclustering method is a technique that makes this possible and al-
lows the grouping of genes and conditions simultaneously within a
data matrix. The goal of biclustering is to find a bicluster that is a
subset of genes that show similar behavior under a specific subset
of the conditions (Cheng & Church, 2000). Thus, genes in the same
bicluster are co-expressed and further are likely to be co-regulated.

In the analysis of time-series expression data, a set of genes may
activates a particular biological process over a certain contiguous

http://dx.doi.org/10.1016/j.eswa.2009.07.053
mailto:horng@db.csie.ncu.edu.tw
http://www.sciencedirect.com/science/journal/09574174
http://www.elsevier.com/locate/eswa


L.-C. Wu et al. / Expert Systems with Applications 37 (2010) 2202–2213 2203
set of conditions instead under a discrete condition. In such a case,
we should find biclusters for a contiguous subset of conditions.
However, in fact, co-expressed genes do not regulate each other
simultaneously but only after a certain time lag. That is to say,
there is a transcriptional time lag whereby the regulator gene takes
time to express its protein product and a further delay occurs as
the target gene responds to the regulator protein. Hence, because
of the transcriptional time lag of co-regulated genes, we need to
take time-lagged co-regulated genes into consideration when
forming biclusters. In addition, when considering time-lagged sim-
ilarity of expression patterns between genes, it is necessary to con-
sider biclusters with coherent values for both the rows and
columns of the expression matrix. This is because their expression
relies on a promoter that is a structural regulatory sequence recog-
nized by a TF of the RNA polymerase holoenzyme. The reason that
co-expressed genes share a common sequence within their pro-
moter will therefore result in shared expression. However, the rec-
ognition efficiency of this TF is not the same for every gene having
the same promoter. This condition leads to biclusters having a vari-
ety of coherence values. There are two kinds of coherence values
for a bicluster. The first is a shifted similarity pattern that can be
viewed as based on an additive model. The second is a scaled sim-
ilarity pattern that can be viewed as based on a multiplicative
model. In a mathematical sense, scaling and vertical shifting of
the expression level can be referred to as linear transformations.
Consider two time-series x and y. In this case y is a linear transfor-
mation of x if it can be expressed as y ¼ mx þ b.

Many biological processes show periodic pattern such as the
cell cycle process, therefore it is useful to find periodically regu-
lated genes with similar periodic patterns of expression. We can
then use these cell-cycle regulated genes to map the transcrip-
tional regulatory network that controls the cell cycle.

A suffix tree is a data structure that contains all suffixes of a
string s. It has been widely used for string matching and exact se-
quence comparison (Ukkonen, 1995). This approach was used to
develop an algorithm for building a suffix tree that runs in time
OðnÞ. Once a suffix tree is built, most problems can be solved in lin-
ear-time using it. The suffix tree built for a set of strings is called a
generalized suffix tree (Gusfield, 1997). In order to avoid creating
empty suffixes, we usually append to s an extra character $ before
the building of the suffix tree. The key feature of the generalized
suffix tree is that any leaves in this tree contain two pieces of infor-
mation: The first is the string number and the second is the start-
ing position of a suffix that makes up this string.
2. Related work

A great many approaches have been developed for the identifi-
cation of co-regulated genes from microarrays. The correlation
method is one that determines whether two variables have a
strong global association, but this approach does not take time
lag issue into consideration. However, another correlation method,
the Cross-Correlation Method (Kato et al., 2001), is different from
the traditional Pearson Correlation approach. It takes into account
time-lagged co-regulations when testing the expression levels of
two genes and identifies if there is a significant linear relationship.
However, this approach still only determines whether two genes
have strong global similarity but does not determine local similar-
ity. Yet another approach is the edge detection method (Filkov
et al., 2001), which scans through each gene’s expression curve
to find where big changes in expression level (edges) occur, and
then sums up the number of edges making up the expression
curves of the two genes that have the same direction in order to
generate a similarity score. Gene pairs are likely to have a posi-
tive-regulated relationship and if this is true, they will be given a
high score. On the other hand, edges that are farther apart are gi-
ven a lower score. Although the edge method strongly focuses on
local similarity between two gene expression curves, there are al-
ways too few edges that match between gene pairs and this gives
rise to relatively low similarity scores. Yet another approach is the
Event Method (Kwon et al., 2003). This first transforms the direc-
tional changes in expression level into a directional event (Rising,
Constant or Falling) by calculating the slope values at each time
point; these are then converted into events. When the transforma-
tion process is complete, global sequence alignment by the Needle-
man–Wunsch algorithm is used to find the best match between the
two event strings taking noise and time lag into account. Gene
pairs that have an activation relationship are likely to have a high
similarity score for the event strings. Although the event method is
efficient at finding gene TF–target relationships, it is unable to pro-
duce a high score for related genes when the event strings are sim-
ilar for periodical short matches because the method uses global
alignment. Moreover, it is computationally inefficient because it
needs n2 pairwise global alignments if there are n genes. There is
one further limitation: this is that it provides putative TF–target
relationships for users by using gene pairs instead of gene clusters.
It is clear that clusters are more efficient than pairs when finding
TF–target relationships. The next available approach is CLARITY
(Balasubramaniyan et al., 2005). This can find locally similar re-
gions in gene expression profiles by measuring similarity based
on Spearman rank correlation. In order to discovering local time-
shifted relationships between two profiles, the program enumer-
ates all possible alignments in a systematic way. In order to reduce
the complexity of computation for Oðn2Þ, it use an approximate
algorithm. Although CLARITY tests similarity between genes by
measuring shape (the qualitative behavior) of the expression pro-
files, which is very useful, there are a few limitations to this pro-
gram. One exception is where cell cycle co-expressed genes are
highly related to the cell cycle but their qualitative behavior differ-
ent over multiple cycles. The result is a horizontal shift and scale
problem with the expression profiles and CLARITY finds it difficult
to identify such genes as having a strong association. A further ap-
proach is q-cluster (Ji & Tan, 2005). Here, the profile is transformed
into three type of change denoted by 1, �1 and 0. In this method,
the pattern of a q-cluster is indicated as an event string of length
ðq � 1Þ to show the changes that occur and this reflects how
expression level changes from condition i to condition i þ 1 under
the q conditions. As the data is transformed into three distinct clas-
ses, there are 3q�1 q-clusters in total and each q-cluster has a un-
ique q-value, where 0 6 q-value 6 3q�1 (Ji & Tan, 2005). Although
q-cluster provides users with detailed information that allows
the detection of periodically co-regulated genes, users must search
for these genes themselves. That is to say, q-cluster does not fur-
ther group genes together according the periodic similar patterns
relative to the cell cycle expression data.

Spellman et al. (1998) first performed a genome-wide transcrip-
tional analysis of the mitotic cell cycle of Saccharomyces cerevisiae
using microarrays and found about 800 periodically expressed
genes that peak each cycle. We can regard these genes as cell-cycle
regulated genes. In another similar investigation (Whitfield et al.,
2002) it proved possible to identify genes periodically expressed
over the human cell cycle using microarray analysis. Although a
large number of studies (Cho et al., 1998; Spellman et al., 1998)
have revealed over 800 genes that are cell-cycle regulated in yeast,
there has been a lacking of a good method to further group these
periodically expressed genes into clusters by biclustering in order
to get an understanding of transcription regulatory networks with-
in cell cycle.

When biclustering (Cheng & Church, 2000) was first applied to
gene expression data, it used the measure, mean squared residue,
to find the biclusters. In later years, many approaches to


2204 L.-C. Wu et al. / Expert Systems with Applications 37 (2010) 2202–2213
biclustering of expression data have been proposed. However, if
these approaches focus on finding exclusive biclusters in a time
series of expression data, there are a number of problems that
are of concern. A major problem is that they ignore many addi-
tional relationships across the time series such as time-lagged
relationships between transcription factors and thus genes that
are activated by this transcription factor. Another problem is that
the process groups genes that show similar activity patterns un-
der a subset of discrete conditions instead of under contiguous
conditions. In the analysis of time series dataset, it is reasonable
to restrict the analysis to biclusters with contiguous columns to
reduce the time complexity of computation.
3. Methods

In order to take into account the bicluster’s coherent values
when biclustering, we firstly transform the expression data into
event strings. We then use the set of event strings to construct a
generalized tree. When the generalized tree has been built, we
can use it to form biclusters quickly while taking time lags into
consideration. Furthermore, during the analysis of cell cycle
expression data, we need to focus on finding cell-cycle regulated
genes. Therefore, we further transform the event strings into trans-
actions representing what event substrings (items) are included in
genes’ event strings (transactions). Next, we use Apriori’s concept
to obtain the set of similar patterns (items) that occur simulta-
neously with the same genes (transactions). Finally, we use the
positions at which the similar patterns occur for the same genes
to further group the genes taking periodically similar patterns
and time lags into consideration.
if a00i;j�1 ¼ U;a
00
i;j ¼ N;Ai;j > s; a

00
i;jþ1 ¼ N;Ai;jþ1 > s; . . . ; a

00
i;jþk ¼ N;Ai;jþk > s; Ai;jþkþ1 > s; a

00
i;jþkþ1 ¼ D;

then a00i;j ¼ H;a
00
i;jþ1 ¼ H; . . . ; a

00
i;jþk ¼ H;

if a00i;j�1 ¼ D;a
00
i;j ¼ N;Ai;j < �s; a

00
i;jþ1 ¼ N;Ai;jþ1 < �s; . . . ; a

00
i;jþk ¼ N;Ai;jþk < �s; Ai;jþkþ1 < �s; a

00
i;jþkþ1 ¼ U;

then a00i;j ¼ L;a
00
i;jþ1 ¼ L; . . . ; a

00
i;jþk ¼ L:

8>>>><
>>>>:
3.1. Phase 1: Transforming expression data into event strings

When we have n genes and m conditions in the form of time-
series expression data, the time-series expression data can be
represented as a A ¼ n � m matrix, where Ai;j represents the
Table 1
The original expression matrix.

C1 C2 C3 C4 C5 C6

G1 �0.42 �0.01 0.19 �0.09 �0.10 �0.68
G2 �0.43 0.15 0.3 0.05 0.07 �0.02
G3 �0.11 0.27 0.38 0.46 �0.06 �0.29
G4 �0.27 �0.13 0.2 0.17 0.09 0.14
C5 �0.25 �0.28 �0.35 0.82 0.58 0.02

Table 2
The matrix showing the changes from the original matrix.

C1,2 C2,3 C3,4 C4,5 C5,6 C

G1 0.98 20 �1.47 �0.11 �5.8
G2 1.35 1 �0.83 0.4 �1.29
G3 3.45 0.41 0.21 �1.13 �3.83
G4 0.52 2.54 �0.15 �0.47 0.56
G5 �0.12 �0.25 3.34 �0.29 �0.96 �
expression level of gene i under condition j. First, Matrix A is con-
verted into a A0 ¼ n �ðm � 1Þ matrix such that

A0i;j ¼

Ai;jþ1�Ai;j
jAi;jj

if Ai;j–0;

Infinity if Ai;j ¼ 0 and Ai;jþ1 > 0;
�Infinity if Ai;j ¼ 0 and Ai;jþ1 < 0;
0 if Ai;j ¼ 0 and Ai;jþ1 ¼ 0:

8>>>><
>>>>:

to reflect the change of each gene expression level between two
neighboring time points. The transformation function we used
comes from q-cluster. Each entry A0i;j reveals the directional change
(the rate of change across time) from the expression level of gene i
at time point j to the expression level of gene i at time point j þ 1.
After A has transformed into A0 matrix, we are interested in how
the change in the gene expression level can be converted into a
set of symbols, R. R contains five symbols, fD; U; N; H; Lg meaning
down-regulated, up-regulated, unchanged, unchanged with high
expression, and unchanged low expression, respectively. We use a
threshold t (same with q-cluster) to transform the changes into
event symbols such that

A00i;j ¼
U if A0i;j P t;

D if A0i;j 6�t;
N otherwise:

8><
>:

Further, we also use a threshold s to transform the symbol, N
into H or L to describe what happens with the expression profiles
in more depth. For every sub-matrix a00 ¼ 1 � k of A00 whose start-
ing row is i and starting column is j, where i 2 1 . . . :n; j 2 1 . . .
m � 1; j þ k 6 m � 1, and
After the conversion phase, matrix A0 is transformed into matrix A00

and A00i;j represents the converted symbol of the change from the
expression level of gene iat time point j to the expression level of
gene i at time point j þ 1. As an example, Tables 1–3 show the process
of converting the gene expression levels. The original matrix A is
shown in Table 1. In Table 2, the matrix A0 reflects the changes in
C7 C8 C9 C10 C11 C12

�0.16 0.33 0.29 0.11 0.35 0.21
0.33 0.26 �0.41 0 �0.4 0.03
�0.49 �0.11 0.03 �0.16 0.43 �0.03
�0.34 0.1 0.23 �0.34 0.13 0.19
�0.51 �0.22 �0.4 �0.22 1.07 0.11

6,7 C7,8 C8,9 C9,10 C10,11 C11,12

0.76 3.06 �0.12 �0.62 2.18 �0.4
17.5 �0.21 �2.58 1 �1 1.08
�0.69 0.78 1.27 �6.33 3.69 �1.07
�3.43 1.29 1.3 �2.48 1.38 0.46
26.5 0.57 �0.82 0.45 5.86 �0.9


Table 3
The result of conversion of the original matrix. Some event symbols N are have transformed into H to describe what is happening within the gene profiles in more depth.

C1,2 C2,3 C3,4 C4,5 C5,6 C6,7 C7,8 C8,9 C9,10 C10,11 C11,12

G1 U U D N D U U N ? H D U D
G2 U U D U D U N ? H D U D U
G3 U U N ? H D D D U U D U D
G4 U U N ? H D U D U U D U U
G5 N N U N ? H D D U D U U D

Table 5
The candidate 2-itemsets after pruning. G2[5, 3] and G5[2, 4] can not be counted as
one because their positions overlap. The items (similar patterns) in the itemset are
not limited to be different from each other, see itemset, UDU, UDU. The itemsets
{UUD, UUD}, {UHD, UHD} and {UHD, DDU} have a support of less than 2 and therefore

L.-C. Wu et al. / Expert Systems with Applications 37 (2010) 2202–2213 2205
gene expression levels on conversion from matrixA. Finally, the re-
sult of conversion process ðt ¼ 0:3; s ¼ 0:1Þ forms matrix A00, which
is shown in Table 3. In the latter case, some symbols N have now been
transformed into H, because, despite the expression not varying
greatly, the expression is still high and therefore important in itself.

3.2. Phase 2: Using converted matrix to construct the generalized tree

Let the set of event strings S ¼fS1; S2; . . . ; Sng be obtained from
each row in converted matrix A00, which comes from the first phase.
Each of these strings has m � 1 event symbols and corresponds to
the symbols in a row of the converted matrix. To build a suffix tree,
add a special end of string character,, to each event string. Then,
use the set of event strings S to construct the generalized tree T.
After the generalized suffix tree T has been built, we can use it to
find the common subsequences shared among more than two
sequences. Based on such common subsequences, the similar
sequences (event strings) are grouped together. Use of a generali-
zed suffix tree allows a pattern of length n to be found in kstrings
in Oðn þ kÞ time. This is known as Exact String Matching. We can
use Exact String Matching to find co-regulated genes in a similar
way to q-cluster. As shown in Table 4, genes which have an arbi-
trarily similar pattern of length 3 can be grouped together. Like
q-cluster, users could then choose an interesting pattern that is re-
peated to pinpoint TF–target relationships. For example, G3 and G5
have a similar pattern, UUD, and G5 activates this pattern after G3
by two time lags. A TF–target relationship is likely between G3 and
G5. So far, our method is able to achieve the same objectives as q-
cluster, but our approach gives more. If users want to see other
lengths among similar patters, users just query through the gener-
alized tree T that has been built by transforming the event strings.
This is not true for q-cluster and the q-clusters need to be calcu-
lated again. Furthermore, although q-cluster provides users with
detailed information that helps the detection of periodically co-
regulated genes, users still have to search manually. That is to
say, q-cluster does not further group genes together according to
the periodic similar patterns for cell cycle expression data. In our
method, we take into account periodically similar patterns to
group periodically co-regulated genes as can be seen in the next
phase.

3.3. Phase 3: Identifying similar patterns (items) that occur
simultaneously among genes (transactions)

Let T ¼fT 1; T 2 . . . T ng be the set of transaction, where n is the
number of genes in expression data. A transaction can be thought
Table 4
Grouping genes that have similar activation patterns.

Pattern Gene [start position]

UUD G1[0], G2[0], G3[6], G4[6], G5[8]
UHD G1[6], G2[5], G3[1], G4[1], G5[2]
UDU G2[1], G2[3], G2[8], G3[7], G4[4], G4[7], G5[6]
� � � � � �
DDD G3[3]
DDU G3[4], G5[4]
DND G1[2]
of as a basket (gene) that contains a set of items (event substrings
of a fixed length), which are bought together. A generalized suffix
tree allows us to quickly find common subsequences that are
shared among gene event strings; therefore we can use it to built
transactions representing the event substrings that are included
in gene event strings. When we have a set of transactions, T, we
can use these to find the frequent h-itemsets (h similar patterns),
and count the number of times they appear simultaneously in
transactions (genes). The number of times that a transaction is
found is defined by us as the support S, which is different to the
Apriori approach (Agrawal and Srikant, 1994). There are several
steps involved in finding frequent h-itemsets that are similar to
the Apriori algorithm, but there are also some differences. First, find
all frequent 1-itemsets. Then, do the next steps for each iteration
until the frequent h-itemsets are found. The candidate k-itemsets
are next generated by merging frequent ðk � 1Þ-itemsets. As the
next step, prune the candidate k-itemsets whose subsets are infre-
quent. Take the candidate k-itemsets that survive the pruning pro-
cess and count the number of times they appear simultaneously
among transactions. If the k items (similar patterns) among every
k-itemsets do not overlap (similar pattern positions should not
overlap each other), then we count this as one. Furthermore, we
can limit the distance of items’ starting positions to be larger than
dand then count this as one. Finally, extract the frequent k-itemsets,
a subset of candidate k-itemsets, with counts that are no less than
the support S. For example, if we use the transactions shown in Ta-
ble 3, they can be transformed into frequent 2-itemsets and the re-
sult is shown in Table 5. It should be noted that in our method, the
items in every itemsets are not limited to being different from each
other. The reason that this is done is so we can take into account in-
stances when the periodic similar patterns among genes are the
same. If G2 and G4 are examined, it will be found that they have
the periodic similar pattern, UDU, which occur at position [3, 8] in
G2 and at position [4, 7] in G4. Although we have collected the
transactions (genes) that have period similar patterns, we still need
to further group these genes according to the periodically similar
patterns’ positions to find related transactions; this is carried out
in the next phase. If the number of grouped genes is less than
threshold S, it is clear that the number of grouped genes can never
greater than the threshold S. Thus, we can regard the threshold S as
the final frequent 2-itemsets are {UHD, DDD}, {UHD, UDU}, {UDU, UDU}, etc.

Patterns Gene [start positions] Support

UUD, UUD 0
UUD, UHD G1[0, 6], G2[0, 5], G3[6, 1], G4[6, 1], G5[8, 2] 5
� � � � � �, � � �
UHD, UHD 0
UHD, UDU G2[5, 1], G2[5, 3], G2[5, 8], G3[1, 7], G4[1, 4], G5[2, 6] 5
� � � � � � � � �
UHD, DDU G3[1, 4] G5[2, 4] 1
UDU, UDU G2[3, 8], G4[4, 7] 2
� � � � � � � � �


2206 L.-C. Wu et al. / Expert Systems with Applications 37 (2010) 2202–2213
a first cutoff that is used to limit the size of the grouped genes and
avoid unnecessary computations. The detailed pseudo code is
below:

Input: The number of similar patterns,h.
Output: frequent h-itemsets.
Scan the transactions to find frequent h-itemsets

1: Scan the transactions t 2 T to find frequent 1-itemsets;

2: for ðk ¼ 2; k <¼ h; k þþÞ{
3: Generate candidate k-itemsets; // By merging a frequent
ðk � 1Þ-itemsets.

4: Prune candidate k-itemsets whose subsets are infrequent;
5: Scan the transactions t 2 T to count the occurrences of

itemsets in candidate k-itemsets; // kitems’ starting position
do not overlap.

6: Extract the frequent k-itemsets, a subset of candidate k-
itemsets with counts no less than support S;

7: }
8: Return h-itemsets;
3.4. Phase 4: Further grouping of genes according to their similar
pattern positions

When we have obtained frequent h-itemsets from the above
phases, we need to further group the genes according the positions
at which these similar patterns (items) occur simultaneously for
the genes (transactions). Using Table 5 as an example, it is possible
to see how further clustering is carried out. We already know that
the itemset, UUD, UHD, in Table 5 occurs at positions, [0, 6] in G1,
[0, 5] in G2, [6, 1] in G3, [6, 1] in G4 and [8, 2] in G5. Suppose that a
node is a set of positions for itemset, UUD, UHD, among the same
gene (transaction). Hence, this set of nodes is {G1[0,6], G2[0, 5],
G3[6, 1],G4[6, 1],G5[8, 2]}. The edge is the Euclidian Distance be-
tween two nodes. For example, the distance between node,

G1[0, 6], and node, G2[0, 5] is
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð0 � 0Þ2 þð6 � 5Þ2

q
¼ 1. We create

a graph that contains these nodes and computed the edges as
Fig. 1. The process of grouping. (a) The original graph has five node and 10 edges. The dis
has the minimal distance 0 within whole graph. We merge node G3 and G4, and then u
merged node and the other nodes. (c) We merge nodes, G1 and G2, and update this me
compute the distances between the merged node and the other nodes. (d) The final res
shown in Fig. 1a. When the graph is complete, the edge containing
the minimum distance is first selected. If the selected edge’s dis-
tance is less than the cutoff threshold d we set it to3 and the two
nodes that are connected by this edge are merged (see Fig. 1b).
The merged node’s position is re-defined as the average of the ori-
ginal two nodes’ positions (see Fig. 1b) and the distances along the
edges that the merged node is connected with are re-computed for
next merging. Next, we similarly select another edge that has the
minimal distance for the whole graph and then merge two nodes
if the minimal distance between them is less than the cutoff
threshold d. Of course, we then should update the merged nodes
and the edges that connect the new node with the other nodes.
This merging process is continued until we can no longer find a
minimal distance that is less than the cutoff threshold d. The final
result is shown in Fig. 1d. We find that we have indeed grouped
these genes into two groups, [G1, G2] and [G3, G4, G5] according
to the similar pattern positions. The pseudo code is presented
below:

Input: The frequent h-itemset and these transactions (genes)
that items (similar patterns) among this frequent h-itemset
occur simultaneously at.
Output: The result of grouping according the positions.
1: Build the initial graph;
2: Compute Euclid Distance between two nodes;
3: repeat
4: minDistance = Select the minimal distance;
5: if(minDistance < cutoff threshold dÞ {
6: Merge two nodes and then update merged node and dis-

tances between merged node and other nodes;
7: }else{
8: Break;
9: }

10: end
11: Return grouping result;
tances are already computed by Euclidian Distance. (b) The edge between G3 and G4
pdate the merged node’s positions. We also re-compute the distances between the
rged node’s positions by averaging the merging nodes’ positions and then also re-

ult of grouping according to the original positions is two clusters.


L.-C. Wu et al. / Expert Systems with Applications 37 (2010) 2202–2213 2207
3.5. Phase 5: Functional enrichment

Based on the hypothesis that co-expressed genes are likely to
have a similar function, we can use this to help analyze the large
number of clusters generated by our procedure. In order to show
that the clusters we generated have biological significance, we
use p-values that are obtained from the hypergeometric distribu-
tion. The probability (p-value) of observing at least m genes from
a specific class within a study cluster of size n is given by

p-value ¼ 1 �
Xm�1
i¼0

f

i

� �
�

N � f
n � i

� �

N

n

� �

where f is the total number of genes from the population within a
specific class and N is the total number of genes within the total
population (defined as all genes represented on the microarray).
Thus, the p-value corresponds to the probability of obtaining at
least m gene of the cluster in a random set of size n. A low p-value
indicates that the genes annotated belong to an enriched class that
is statistically biologically significant. Our system has been coupled
to Ontologizer (Robinson et al., 2004) to calculate p-values to detect
statistically significant enrichment within one or more GO catego-
ries. Furthermore, we also calculate p-value for the statistically sig-
nificant enrichment for some TFs’ known target-gene groups.
Moreover, we have modified some open source code from Ontolo-
gizer so that these subroutines are completely integrated into our
system.
4. Experimental results

4.1. Clustering with the aim of finding time-lagged co-regulated genes

In this experimental system, we applied Spellman’s yeast cell
cycle dataset that includes 6331 open reading frames. The full
dataset contains all the expression data for the alpha factor, cdc
15 and elutriation time course experiments. We used only the al-
pha factor dataset to validate our approach. We examined the re-
sults obtained with this dataset by our method for the detection
of time-lag co-regulated genes over the cell cycle. We applied the
approach to the cell-cycle regulated genes that were identified
by Spellman’s group (Spellman et al., 1998) in our experiments
by using the identified cell-cycle regulated genes as the population
of genes tested using our method. These identified cell-cycle regu-
lated genes involves 800 genes and include 14 well-known cell-cy-
cle TFs, SWI4, YOX1, SWI5, FKH1, RAP1, YHP1, HCM1, ACE2, STP2,
GAT3, PHD1, GCR1, TEC1 and MET28. Furthermore, we add some
other known cell-cycle TFs to the original dataset including
ARR1, MBP1, RPN4, YAP1, FKH2, MCM1, STE12, YAP5, STP1 making
Fig. 2. Fkh2 (black bold line) really regulates the genes displayed here from Cluster 1
a total of 809 genes. The Spellman’s yeast cell cycle dataset covers
two cell cycles and for this reason we formed clusters based on
whether the gene profiles had two similar patterns of length 3
units over two cell cycles. The threshold s was set to 0.1 and the
threshold t was set to 0.3. The distance d between two similar pat-
terns was set to 2. The threshold D was 5 to take into account time
lag. The support S for cluster size was set at 2 to avoid missing any
significant clusters.

Our method is designed to fit the special needs of users; hence,
users are able to pick out clusters of interest according to what
they think are significant similar patterns within these clusters.
For example, someone may believe that the patterns, UUH, UUD,
are significant whereas the patterns, DDD, UUU, are insignificant.
In order to filter some clusters that are insignificant, we can pick
up clusters whose similar patterns both start with the character,
U, and contain at least one TF. To realize the biological relevance
of these chosen clusters, the p-value is calculated from the hyper-
geometric distribution in order to model the probability of observ-
ing at least mgenes, from a cluster with n genes that may by chance
contain a TF that regulates f known (documented) genes from the
809 genes in the dataset. We use YEASTRACT (Teixeira et al., 2006)
to search for documented TF–target relationships for every cluster
and then calculate the p-value for each TF that is grouped with its
known regulated genes. We present only the top few clusters
(p-value < 0.01), which are summarized in Table 9 in the Appendix.
Although most of the clusters we have picked up do not show a
statistically significant p-value, the result is really quite good. If
Cluster 1 is examined, it is found to contain the TF, Fkh2, and this
TF would seem to regulate the genes found in Cluster 1. In order to
validate whether Fkh2 really regulates these Cluster 1 genes, we
plot Fkh2 and the genes that the result is coordinated regulation
(see Fig. 2). It can be seen that Fkh2 over-expresses simultaneously
with or before the genes in this cluster and this supports the idea
Fkh2 regulates these genes. Similarly, in chosen Cluster 2, Swi4
would seem to regulate the genes that form this cluster. In order
to confirm regulation they were also plotted in a similar way to
above (see Fig. 3). The Edge Detection and Event Method is one
method of finding a regulated relationship. However, there are
some drawbacks. Firstly, the result is regulator–target pairs instead
of clusters, but a TF does not regulate only one gene but a group of
genes with similar function; thus cluster results are more useful
than pairs. Secondly, the regulator–target pairs that are identified
by the Event Method only say there is a have high correlation
and do not provide more detailed information such as the exact
lag time.

We can also pick up clusters whose similar patterns both start
with the character U but do not contain any known TFs. Then,
we used YEASTRACT to obtain TFs that regulate some of the genes
from the chosen clusters. We calculated the p-value obtained from
the hypergeometric distribution to model the probability of
because it over-expresses simultaneously or before than known regulated genes.


Fig. 3. Swi4 (black bold line) really regulates the genes displayed here from Cluster 2 because it over-expresses simultaneously or before than known regulated genes.

2208 L.-C. Wu et al. / Expert Systems with Applications 37 (2010) 2202–2213
observing at least m genes from a chosen cluster with n genes by
chance if a TF (not be included in this chosen cluster) regulates f
known genes from the 808 genes in the dataset. We again present
only the top clusters (p-value < 0.01), which are summarized in Ta-
ble 9. The statistically most significant clusters (p-value < 0.01)
that contain a TF and also contain this TF’s regulated genes. It is
clear that if the percentage of a cluster’s genes that are regulated
by a TF is much more larger than the percentage of the population
genes that are regulated by the same TF, this cluster is more statis-
tically significant.

Table 10 in Appendix the results show that although the chosen
clusters do not contain any TFs, most of the genes among them
seem to be regulated by a TF. Moreover, based on the fact that
genes with a similar function are always regulated by same TF,
the results reveal that the clusters generated by our method have
some biological significance. We also plotted the TF for the known
regulated genes and the gene expression profiles of the cluster to
Fig. 5. Although Ace2 (black bold line) does not group with its known regulated genes
expresses simultaneously or before than known regulated genes.

Fig. 4. Although Swi5 (black bold line) does not group with its known regulated genes
expresses simultaneously or before than known regulated genes.
validate whether the group created by our method occurred by
chance or had real biological significance. If we plot the TF Swi5
and genes’ expression profiles from Cluster 21 (see Fig. 4), we find
that Swi5 is over-expressed simultaneously or before the regulated
genes and this supports the idea that Swi5 regulates these genes.
Similarly, the TF Ace2 seems to regulate the genes from Cluster
16 (see Fig. 5).
5. Comparison of other approaches

In order to test whether our method could group co-regulated
genes that other method find hard to group, we compared our
method with some correlation measures such as Spearman Rank
Correlation and Pearson Correlation. Initially, we picked two genes,
MCM7 and MCM4, which are components of the MCM complex
(Davey et al., 2003). Fig. 6 shows the expression levels of these
, we could show that it really regulated the genes displayed here because it over-

, we could show that it really regulated the genes displayed here because it over-


Fig. 6. The profiles of two genes, MCM7 and MCM4, which are components of the MCM complex.

Table 6
The pairwise similarity between MCM7 and MCM4 as calculated by Pearson
Correlation and Spearman Rank Correlation.

Pearson Correlation Spearman Rank Correlation

0.6 0.66

Table 7
The pairwise similarity matrix for CDC68, SWI4 and CLN1 as calculated by Pearson
Correlation.

CDC68 SWI4 CLN1

CDC68 0 0.68 0.69
SWI4 0 0 0.83
CLN1 0 0 0

Table 8
The pairwise similarity matrix for CDC68, SWI4 and CLN1 as calculated by Spearman
Rank Correlation.

CDC68 SWI4 CLN1

CDC68 0 0.70 0.70
SWI4 0 0 0.91
CLN1 0 0 0

L.-C. Wu et al. / Expert Systems with Applications 37 (2010) 2202–2213 2209
two genes and similar expression patterns that were detected by
our method. Then, we calculated the pairwise similarity between
the expression profiles of the individual genes according to Spear-
man Rank Correlation and Pearson Correlation (see Table 6). We
find that the pairwise similar scores for MCM7 and MCM4 are actu-
ally quite low (60.7). The reason for this is presence of scaled and
shifted similarity patterns between these co-regulated genes and
this causes these co-regulated genes to be difficult to group to-
gether by numerical measures. In contrast, our method transforms
the expression profiles into event strings to take into account scal-
ing and shift factors in the expression level between related genes.

We picked out three genes, CDC68, SWI4 and CLN1 that have a
regulated relationship between each other. CDC68 is a required
activator of SWI4 and SWI4 is a required activator of the G1 cyclin
genes CLN1 and CLN2. CDC68’s role at the CLN promoters may be
indirect (Lycan et al., 1994).

Fig. 7 presents the expression levels of these three genes and we
also calculated the pairwise similarity matrix between the expres-
sion profiles of these three genes according to Spearman Rank Cor-
relation and Pearson Correlation (see Tables 7 and 8, respectively).
In a similar way to the previous result, we found that the pairwise
similar scores are not high with half of scores being relatively low
(60.7). Again, Spearman Rank Correlation and Pearson Correlation
find it hard to group these genes using similarity matrixes. In real-
ity, there are some drawbacks to these two numerical measures.
Firstly, they do not provide detailed information on co-regulated
gene pairs including time lag between them and the pattern simi-
larity between two genes. The information on similar patterns that
we are interested in includes the starting time point and what hap-
pens over time to the patterns such as the change between two
neighborly time points.
Fig. 7. The profiles of three genes, CDC68, SWI4 and CLN1, w
6. Conclusion

By converting the gene expression values, we present a local
method to find potential TF–target relationships allowing the
detection of time-lagged gene clusters based on periodically sim-
ilar patterns over multiple cycles. Genes that have the same peri-
odic patterns are grouped together and these genes are likely to
be cell-cycle regulated genes that are controlled simultaneously
by a TF. Generally, these TFs are included in same group as the
regulated genes. The similarity of the two expression subprofiles
is according to the specific changes in expression level by the
genes. Our approach provides users with more detailed informa-
tion about the detected similar patterns and users can use this
information to choose the more significance clusters themselves
and thus decrease the number of candidate clusters.

We have experimented with our method on a time-series
expression dataset (Spellman et al., 1998). Genes with similar
hich have regulated relationships between each other.


2210 L.-C. Wu et al. / Expert Systems with Applications 37 (2010) 2202–2213
patterns are found to be co-expressed genes and further, they are
likely to have a TF–target relationship. In order to validate our
method, we used p-value to show that the clusters we generated
were able to help find TF–target relationships.

Our method was compared with some other measures such as
Pearson correlation and Spearman rank correlation and we found
that our method is able to group highly related genes together that
other measures found hard to do. Furthermore, unlike the other
methods, our approach can provide more detailed information on
Table 9
The statistically most significant clusters (p-value < 0.01) that contain a TF and also contain
regulated by a TF is much more larger than the percentage of the population genes that a

Cluster
number

Similar
patterns

Number of
genes (n)

Cluster with
this TF

Number of genes
be regulated by
this TF (m)

% o
gen
by

1 UUH, UUH 25 Swi4 18 72

2 UUN, UUU 25 Fkh2 12 60

3 UUU, UUU 77 Fkh2 16 33.

4 UHD, UUH 26 Swi4 14 53.

5 UHD, UUH 31 Swi4 15 50

6 UUH, UUN 10 Fkh2 6 60

7 UHD, UHD 33 Swi4 15 46.

8 UDD, UUU 18 Yhp1 5 31.
the TF–target relationships. Finally, the results from the Edge
Detection and Event Method are pairs of genes and the clusters ob-
tained by our method are clearly more useful than pairs for finding
a TF–target relationship.
Appendix A

See Tables 9 and 10.
this TF’s regulated genes. It is clear that if the percentage of a cluster’s genes that are
re regulated by the same TF, this cluster is more statistically significant.

f this cluster’s
es be regulated
this TF

% of population
genes be regulated
by this TF (f)

p-Value Genes are included in this
cluster are this TF’s known
regulated genes

15.9 2.74E�10 YPL127C, YJL187C
YMR199W, YPL267W
YPR202W, YER189W
YHL049C, YEL075C
YDR224C, GR109C
YBL003C, YHR218W
YPL163C, YER111C
YCR065W, YMR307W
YDR225W, YFL064C

10.1 3.51E�8 YNL068C, YJR092W
YPL141C, YNL058C
YJL051W,YPR119W
YML119W, YML034W
YMR032W, YLR190W
YGR108W,YHR152W

3 10.1 4.81E�6 YMR001C, YPL141C
YJL051W, YHR023W
YPR119W,YIL106W
YML034W, YNL068C
YJR092W, YKL096W
YPL242C, YNL058C
YPL155C, YLR131C
YPR156C, YHR152W

8 15.9 6.25E�6 YJL187C, YGR014W
YPL267W, YPR202W
YER189W ,YHL049C
YEL075C, YGR109C
YHR218W, YPL163C
YER111C, YCL025C
YCR065W, YFL064C

15.9 9.42E�6 YJL187C, YMR199W
YPL267W, YPR202W
YER189W, YHL049C
YEL075C, YOR074C
YGR109C, YHR218W
YPL163C, YPR120C
YER111C, YCR065W
YFL064C

10.1 1.37E�4 YNL068C, YJR092W
YPR119W, YML034W
YLR190W, YGR108W

8 15.9 2.57E�4 YJL187C, YPL267W
YPR202W, YER189W
YHL049C, YEL075C
YOR074C, YGR109C
YHR218W, YPL163C
YPR120C, YER111C
YCL025C, YCR065W
YFL064C

2 5.3 9.48E�4 YJL194W, YPR019W
YLR274W, YIL106W
YBR202W


Table 9 (continued)

Cluster
number

Similar
patterns

Number of
genes (n)

Cluster with
this TF

Number of genes
be regulated by
this TF (m)

% of this cluster’s
genes be regulated
by this TF

% of population
genes be regulated
by this TF (f)

p-Value Genes are included in this
cluster are this TF’s known
regulated genes

9 UDD, UDD 129 Met28 7 6.1 1.7 1.28E�3 YIR017C, YPR167C
YER091C, YJL078C
YGR055W, YFR030W
YGL184C

10 UDD, UUD 177 Tec1 7 4.6 1.7 7.64E�3 YNL283C, YDR055W
YNL166C, YER070W
YCR018C, YML100W
YBR083W

11 UHD, UUU 34 Fkh1 6 22.2 6.9 7.97E�3 YNL068C, YPR119W
YPL155C, YCL063W
YAR071W, YOR315W

Table 10
The statistically most significant clusters (p-value < 0.01) do not contain any TFs but in which most of the genes are regulated by a TF. It is clear that if the percentage of a cluster’s
genes that are regulated by a TF is much more larger than the percentage of the population genes that are regulated by the same TF, this cluster is more statistically significant.

Cluster
number

Similar
patterns

Number of
genes (n)

Cluster
without this
TF

Number of genes
be regulated by
this TF (m)

% of this cluster’s
genes be regulated
by this TF

% of population
genes be regulated
by this TF (f)

p-Value Genes are included in this
cluster are this TF’s known
regulated genes

1 UHD, UUU 62 Mbp1 18 38.2 12.1 1.71E�6 YDR309C, YOR230W
YGR014W, YNL339C
YIL177C, YHR219W
YHL049C, YDR113C
YDL018C, YKL096W
YMR179W, YJL074C
YHR218W, YHR153C
YLR462W, YML027W
YGR296W, YJL225C

2 UHH, UHH 7 Swi4 7 100 15.9 2.28E�6 YPL127C, YDR224C
YLR183C, YBL003C
YKR013W, YMR307W
YDR225W

3 UHH, UUH 13 Swi4 9 69.2 15.9 2.1E�5 YPL127C, YMR199W
YDR224C, YLR183C
YBL003C, YKR013W
YGR189C, YMR307W
YDR225W

4 UUH, UUU 53 Swi4 17 41.4 15.9 5.18E�5 YMR199W, YNL339C
YKL008C, YPR202W
YHR219W, YHL049C
YEL075C, YNL300W
YDL018C, YBL003C
YMR179W, YHR218W
YGL038C, YLR462W
YML027W, YGR296W
YMR307W

5 UDD, UHD 87 Mbp1 22 27.5 12.1 5.85E�5 YDR309C, YJL073W
YJL187C, YBL111C
YAR008W, YJR030C
YNL339C, YIL177C
YNL030W, YHR219W
YDL018C, YKL096W
YMR179W, YDR545W
YDL003W, YHR153C
YLL067C, YLR103C
YLR462W, YML027W
YGR296W, YJL225C

6 UUN, UUU 9 Fkh1 5 62.5 6.9 6.37E�5 YMR215W, YMR001C
YJL051W, YCL063W
YPR156C

7 UDD, UUH 51 Mbp1 16 32.6 12.1 7.47E�5 YJL187C, YAR008W
YNL339C, YNL030W
YHR219W, YOR075W
YDL018C, YMR179W
YDR545W, YDL003W
YHR153C, YLL067C
YLR103C, YLR462W

(continued on next page)

L.-C. Wu et al. / Expert Systems with Applications 37 (2010) 2202–2213 2211


Table 10 (continued)

Cluster
number

Similar
patterns

Number of
genes (n)

Cluster
without this
TF

Number of genes
be regulated by
this TF (m)

% of this cluster’s
genes be regulated
by this TF

% of population
genes be regulated
by this TF (f)

p-Value Genes are included in this
cluster are this TF’s known
regulated genes

YML027W, YGR296W

8 UDU, UHD 36 Gat3 7 19.4 3.3 8.86E�5 YCR041W, YEL076C
YHL049C, YEL075C
YLR463C, YFL065C
YFL064C

9 UDU, UUH 31 Gat3 6 20 3.3 2.18E�4 YLR467W, YEL076C
YHL049C, YEL075C
YLR463C, YFL064C

10 UUH, UUU 30 Swi4 13 43.3 15.9 2.55E�4 YPL127C, YJL187C
YDR507C, YGR086C
YIL141W, YKL113C
YPL256C, YBL003C
YDL055C, YMR307W
YER001W, YDR225W
YNL031C

11 UDD, UUN 11 Fkh2 6 54.5 10.1 2.77E�4 YPL141C, YNL058C
YJL051W, YML119W
YMR032W, YHR152W

12 UDL, UUU 9 Mcm1 4 66.6 8.8 7.57E�4 YJR092W, YPR119W
YGL021W, YLR131C

13 UUH, UUU 28 Fkh1 7 31.8 6.9 3.95E�4 YJR092W, YPR119W
YGL021W, YPL155C
YLR131C, YML034W
YAR071W

14 UDD, UUU 60 Swi5 11 21.1 7.1 6.01E�4 YDR055W, YKL164C
YPL283C, YGR086C
YBR083W, YLR464W
YLR467W, YGR041W
YKL163W, YNL328C
YLR463C

15 UDD, UUD 20 STP1 5 26.3 4 6.4E�4 YER070W, YJR154W
YDR501W, YML100W
YLR121C

16 UDN, UUH 22 Ace2 5 22.7 4 1.33E�3 YER124C, YNL327W
YKL185W, YHR143W
YLR079W

17 UUD, UUN 15 Yap5 4 28.5 5.9 7.04E�3 YIL129C, YPL141C
YNL058C, YIL158W

18 UDU, UUU 76 Gat3 7 12.5 3.3 1.56E�3 YLR467W, YHL049C
YEL075C, YLR462W
YLR463C, YBL112C
YPR203W

19 UDN, UHD 24 Ace2 5 20.8 4 2.02E�3 YER124C, YNL327W
YKL185W, YHR143W
YLR079W

20 UDN, UUD 13 Yhp1 4 30.7 5.3 3.49E�3 YJL115W, YNL339C
YLR413W, YLL067C

21 UDD, UUH 48 Swi5 9 19.1 7.1 4.22E�3 YLR464W, YLR467W
YKL164C, YLR049C
YPL283C, YLR463C
YJL078C, YGR086C
YPL158C

22 UDD, UDN 41 Gcr1 4 10.8 1.9 4.55E�3 YGR240C, YMR055C
YGR143W, YIL162W

23 UDD, UUU 39 Mcm1 8 22.8 8.8 8.77E�3 YMR001C, YLR040C
YJL157C, YJL194W
YNL058C, YLR274W
YGR143W, YHR152W

24 UDD, UDN 17 Rpn4 6 35.2 10.7 6.08E�3 YOL016C, YDR055W
YBL023C, YKR012C
YLR013W, YOR273C

25 UDD, UdN 39 Yap5 6 18.7 5.9 8.83E�3 YLL066C, YPL208W
YLL067C, YML050W
YBR007C, YPL283C

26 UHH, UUU 25 Mcm1 6 27.2 8.8 9.53E�3 YJR092W, YGL021W
YLR131C, YIL158W
YBR202W, YDR451C

2212 L.-C. Wu et al. / Expert Systems with Applications 37 (2010) 2202–2213


L.-C. Wu et al. / Expert Systems with Applications 37 (2010) 2202–2213 2213
References

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules, Proc.
of the 20th VLDB Conf., Santiago, Chile.

Balasubramaniyan, R., Hullermeier, E., et al. (2005). Clustering of gene expression
data using a local shape-based similarity measure. Bioinformatics, 21(7),
1069–1077.

Cheng, Y., & Church, G. M. (2000). Biclustering of expression data. In Proceedings of
the International Conference on Intelligent Systems of Molecular Biology (Vol. 8, pp.
93–103).

Cho, R. J., Campbell, M. J., et al. (1998). A genome-wide transcriptional analysis of
the mitotic cell cycle. Molecular Cell, 2(1), 65–73.

Davey, M., Indiani, C., et al. (2003). Reconstitution of the Mcm2-7p heterohexamer,
subunit arrangement, and ATP site architecture. Journal of Biological Chemistry,
278(7), 4491–4499.

Filkov, V., Skiena, S., et al. (2001). Identifying gene regulatory networks from
experimental data. In Proceedings of RECOMB.

Gusfield, D. (1997). Algorithms on strings, trees and sequences: Computer science and
computational biology. London: Cambridge University Press.

Ji, L., & Tan, K. L. (2005). Identifying time-lagged gene clusters using gene expression
data. Bioinformatics, 21(4), 509–516.
Kato, M., Tsunoda, T., et al. (2001). Lag analysis of genetic networks in the cell cycle
of budding yeast. Genome Informatics, 12, 266–267.

Kwon, A. T., Hoos, H. H., et al. (2003). Inference of transcriptional regulation
relationships from gene expression data. Bioinformatics, 19(8), 905–912.

Lycan, D., Mikesell, G., et al. (1994). Differential effects of Cdc68 on cell cycle-
regulated promoters in Saccharomyces cerevisiae. Molecular and Cellular Biology,
14(11), 7455–7465.

Robinson, P. N., Wollstein, A., et al. (2004). Ontologizing gene-expression
microarray data: Characterizing clusters with Gene Ontology. Bioinformatics,
20(6), 979–981.

Spellman, P. T., Sherlock, G., et al. (1998). Comprehensive identification of cell cycle-
regulated genes of the yeast Saccharomyces cerevisiae by microarray
hybridization. Molecular Biology of the Cell, 9(12), 3273–3297.

Teixeira, M. C., Monteiro, P., et al. (2006). The YEASTRACT database: A tool for the
analysis of transcription regulatory associations in Saccharomyces cerevisiae.
Nucleic Acids Research, 34(Database issue), D446–D451.

Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica, 14,
249–260.

Whitfield, M. L., Sherlock, G., et al. (2002). Identification of genes periodically
expressed in the human cell cycle and their expression in tumors. Molecular
Biology of the Cell, 13(6), 1977–2000.


	An expert system to identify co-regulated gene groups from time-lagged gene clusters using cell cycle expression data
	Introduction
	Related work
	Methods
	Phase 1: Transforming expression data into event strings
	Phase 2: Using converted matrix to construct the generalized tree
	Phase 3: Identifying similar patterns (items) that occur simultaneously among genes (transactions)
	Phase 4: Further grouping of genes according to their similar pattern positions
	Phase 5: Functional enrichment

	Experimental results
	Clustering with the aim of finding time-lagged co-regulated genes

	Comparison of other approaches
	Conclusion
	Appendix A
	References