key: cord-0932124-nfxcqsc2
authors: Kim, Kwangsoo; Ryoo, Hong Seo
title: A LAD-based method for selecting short oligo probes for genotyping applications
date: 2007-06-01
journal: OR Spectr
DOI: 10.1007/s00291-007-0089-0
sha: 6e112ecc22aed8c3ae72db210eae78989d204da4
doc_id: 932124
cord_uid: nfxcqsc2

Specializing a general framework of logical analysis of data for efficiently handling large-scale genomic data, we develop in this paper a probe design method for selecting short oligo probes for genotyping applications. When tested on genomic sequences obtained from the National Center of Biotechnology Information in various monospecific and polyspecific in silico experiments, the proposed probe design method was able to select a small number of oligo probes of length 7 or 8 nucleotides that perfectly classified all unseen testing sequences. These results demonstrate the efficacy of the proposed probe design method and illustrate the usefulness and potential a well-designed optimization-based probe selection method has in genotyping applications.

Between November 1, 2002 and July 31, 2003, Severe Acute Respiratory Syndrome (SARS) virus infected 8,096 people and proved fatal to 774 worldwide. 1 The avian influenza (AI) virus subtype H5N1 alone infected 152 people worldwide between 2003 and January 2006, and 83 died of the disease. 2 Luckily, none of the outbreaks of SARS and AI infections at the beginning of the new millennium brought about the worst-case scenario. Alarmingly, influenza experts seem to agree that another pandemic may be imminent (Webby and Webster 2003) and, as of this writing, a fearful AI H5N1 virus continues to spread in part of Asia and Europe.

A microarray or a DNA chip is a small glass or silica surface bearing DNA probes. Probes are single stranded reverse transcribed mRNAs, each located at a specific spot of the chip for hybridization with its Watson-Crick complementary sequence in a target to form the double helix (e.g., Schena 1999; Stears et al. 2003) . Microarrays currently use two forms of probes, namely, oligonucleotide (shortly, oligo) and cDNA, and have prevalently been used in the analysis of gene expression levels, which measures the amount of gene expression in a cell by observing hybridization of mRNA to different probes, each targeting a specific gene. With the ability to identify a specific target in a biological sample, microarrays are also well suited for detecting biological agents for genetic and chronic disease (e.g., Eom et al. 2004; Heller et al. 1997; Lee and Lee 2003; Liu et al. 2003) . Furthermore, as viral pathogens can be detected at the molecular and genomic level much before the onset of physical symptoms in a patient, the microarray technology can be used for an early detection of patients infected with viral pathogens (e.g., Sengupta et al. 2003; Vernet 2002; Wang et al. 2002; Zhou et al. 2005) .

The success of microarrays depends on the quality of probes that are tethered on the chip. Having an optimized set of probes is beneficial for two obvious reasons. One, the background hybridization is minimized; hence true gene expression levels can be more accurately determined (e.g., Li and Stormo 2001) . The other, as the number of oligos needed per gene is minimized, the cost of each microarray is minimized or the number of genes on each chip is increased, yielding oligo fingerprinting a much faster and more cost-efficient technique (e.g., Borneman et al. 2001; Li and Stormo 2001) . Short probes consisting of 15-25 nucleotides (nt) are used in genotyping applications (e.g., Stears et al. 2003) . Having short optimal probes means a high genotyping accuracy in terms of both sensitivity and specificity (e.g., Li and Stormo 2001; Sengupta et al. 2003) , hence can play a key role in genotyping applications. For example, in a pandemic, an effective method for selecting short optimal probes may be used in the mass production of a cost-efficient device for screening for the disease in suspected or susceptible hosts. Reverse genetics would be the most rapid means by which to produce an antigenically matched vaccine in a pandemic (Webby and Webster 2003) . An effective probe selection methodology can identify conserved regions of a viral family and hence may prove useful in the preparation of a vaccine via reverse genetics. Furthermore, the methodology can promote the availability of affordable home testing kits for accurate and confidential diagnosis of genetic and infectious disease and allow advanced and adequate medical treatment planning for patients.

A well-studied problem in machine learning and data mining deals with the discovery of a classification rule for different types of data. The probe design, say, for genotyping applications, can be roughly stated as selecting oligo probes for detecting a specific disease-agent in genomic sequences, hence falls into the realm of classical classification. Thus far, this interesting problem at the intersection of molecular biology and optimization has received relatively little attention from the optimization community, and systematic oligo design methods proposed so far are based on a simple greedy procedure (Herwig et al. 2000) , the set covering-based classification methodology (Borneman et al. 2001) , support vector machines (Lee and Lee 2003) , an evolutionary algorithm (Eom et al. 2004; Lee et al. 2004) , and mixed integer and linear programming (Klau et al. 2004) , briefly summarizing.

From the perspective of numerical optimization, genomic data present an unprecedented challenge for supervised learning approaches for a number of reasons. To name a few, first, genomic data are long sequences over the nucleic acid alphabet = {A,C,G,T}. Second, for example, the complexity of viral flora, owing to constantly evolving viral serotypes, requires a supervised learning theory to be trained on a large collection of target and non-target samples. That is, a typical training set contains a large number of large-scale samples. Furthermore, a supervised learning framework usually requires a systematic pairing or differencing between each target and non-target samples during the course of training a decision rule (e.g., Borneman et al. 2001; Boros et al. 2000; Klau et al. 2004; Rahmann 2003) . Owing to these and the nature of general data analysis and classification (Megiddo 1988 ), a supervised learning approach to classification of genomic data without specialized features for efficiently handling large-scale data is confronted by a formidable challenge.

Based on a general framework of logical analysis of data (LAD) from Ryoo and Jang (2005) , we develop in this paper a probe design method for selecting short oligo probes of length l nt, where l ∈ [6, 10]. To list some advantages of selecting oligo probes by the proposed method, first, the method selects probes via sequential solution of a small number of compact set covering (SC) instances, which offers a great advantage from computational point of view. To be more specific, consider classification of two types of data and suppose that a training set is comprised of m + target and m − non-target sequences. The size of the SC training instances solved by the proposed method is minimum of m + and m − orders of magnitude smaller than optimization learning models used in Borneman et al. (2001) and Klau et al. (2004) , for instance. Second, the method uses the sequence information only and selects probes via optimization based on principles of probability and statistics. That is, the probability of an l-mer (oligo of length l) appearing in a single sequence by chance is (0.25) l . Unless statistically significant, an l−mer appearing in multiple samples of one type and none or only a few of the sequences of the other type by chance is extremely small. Third, the proposed method does not rely on any extra tool, such as BLASTn (Altschul et al. 1990 ), a local sequence alignment search tool that is commonly used for probe selection (e.g., Sengupta et al. 2003; Wang et al. 2002; Wang and Seed 2003) , or the existence of pre-selected representative probes (e.g., Sengupta et al. 2003) . This makes the method truly stand-alone and free of problems that may possibly be caused by limitations associated with external factors. As mentioned earlier, the proposed probe design selects optimal probes via sequential solution of SC instances. Although SC is N P-complete (Garey and Johnson 1979) , its wide practical applications have invited an array of efficient (meta-)heuristic solution procedures to be developed. Therefore, last, the proposed method is readily implementable for efficient selection of oligo probes. This paper is organized as follows. In Sect. 2, we specialize a LAD framework from Ryoo and Jang (2005) for efficiently analyzing genomic sequences and develop an effective method for selecting short oligo probes. In Sect. 3, we test the proposed probe design algorithm in various in silico genotyping experiments using viral genomic sequences and report superb experimental results. To summarize, in all monospecific and polyspecific genotyping experiments on classification of viral pathogens using genomic sequences obtained from the National Center of Biotechnology Information website, the proposed probe design method selected a small number of probes of length 7 or 8 nt that perfectly classified all unseen testing sequences. Classifying the "noisy" human papillomavirus (HPV) sequences from the Los Alamos Laboratory by high and low risk types, the proposed probe design method selected optimal probes in a few CPU seconds that classified the testing sequences with 90.6% accuracy. For comparison, Eom et al. (2004) and Park et al. (2003) experimented with the same HPV dataset and reported the classification accuracy of 85.6 and 81.1%, respectively. These in silico results demonstrate efficacy and efficiency of the proposed oligo design method and further illustrate the usefulness and potential of a well-designed optimization-based probe design method in the forthcoming era of biotechnology. Finally, Sect. 4 concludes the paper with a few remarks.

Before proceeding, we refer interested readers to Schena (1999) , Stears et al. (2003) and Vernet (2002) for background in microarray analysis and its usage in the diagnosis of infectious disease. Furthermore, as classification of more than two types of data can be accomplished by sequential classification of two types of data (see, for example, Cortes and Vapnik 1995; Ullman 1973; Vapnik 1998 and Sect. 3), we present the material below in the context of the classification of + and − types of data for convenience and without loss of generality.

The backbone of the proposed procedure is LAD. LAD is a relatively new supervised learning methodology that is based on Boolean logic, combinatorics and optimization. A typical implementation of LAD analyzes data on hand via four sequential stages of data binarization, support feature selection, pattern generation and classification rule formation. As a Boolean logic-based, LAD first converts all non-binary data into equivalent binary observations. A + (−) "pattern" in LAD is defined as a conjunction of one or more binary attributes or their negations that distinguishes one or more + (−) type observations from all − (+) observations. The number of attributes used in a pattern is called the "degree" of the pattern. As seen from the definition, patterns hold the structural information hidden in data. After patterns are generated, they are aggregated into a partially defined Boolean discriminant function/rule to generalize the discovered knowledge to classify new observations.

Referring readers to Boros et al. (2000) , Hammer (1986) and Ryoo and Jang (2005) for more background in LAD, we design a LAD-based method below for efficiently handling and analyzing large-scale genomic data and selecting optimal oligo probes for genotyping applications.

Let there be m + and m − sample observations of type + (target) and − (non-target), respectively. For • ∈ {+, −}, let us use• to denote the complementary element of • with respect to the set {+, −}. Let S • denote the index set of m • sample sequences for • ∈ {+, −}.

A DNA sequence is a sequence of nucleic acids A, C, G and T, and the training sequences need to be converted into Boolean sequences of 0 and 1 before LAD can be applied. Toward this end, we first choose an integer value for l, usually l ∈ [6, 10] (see Sect. 3), generate all 4 l possible l-mers over the four nucleic acid letters and then number them consecutively from 1 to 4 l by a mapping scheme. Next, each l-mer is selected in turn and every training sample is fingerprinted with the oligo for its presence or absence. That is, with oligo j, we scan each sequence p i , i ∈ S + ∪ S − , from the beginning of the sequence and shifting to the right by a base and stamp

After this, the oligos that appear in all or none of the training sequences can be deleted from further consideration. We re-number the surviving l-mers consecutively from 1 to n and replace the original training sequences described in the nucleic acid alphabets by their Boolean representations. Let N = {1, . . . , n}.

The data are now described by n attributes a j ∈ {0, 1}, j ∈ N . For observation p i , i ∈ S • , • ∈ {+, −}, let p i j denote the binary value the j-th attribute takes in this observation. Denote by l j the literal of binary attribute a j . Then, l j = a j (l j = a j ) instructs to take (negate) the value of a j in all sequences. A term t is a conjunction of literals. Given a term t, let N t ⊆ N denote the index of literals included in the term. Then, we have t = j∈N t l j . A • pattern is a term that satisfies t ( p i ) := l j =a j , j∈N t p i j l j =ā j , j∈N tp i j = 1 for at least one p i , i ∈ S • , and t ( p k ) = 0 for all p k , k ∈ S•. Note here that N t of a • pattern identifies probes that collectively distinguish one or more • sequences from the sequences of the other type.

To aid in presentation, let us temporarily introduce n additional features a n+ j , j ∈ N , and use a n+ j to negate a j . Let N = {1, . . . , 2n} and let us introduce a binary decision variable x j for a j , j ∈ N , to determine whether to include l j in a pattern. Ryoo and Jang (2005) formulated a compact mixed integer and linear programming (MILP) model below with respect to a reference sample p i , i ∈ S • , • ∈ {+, −}:

Proof First, via the first constraint of (MILP-2.i • ) and the definition of J i , we trivially have

for the reference observation p i , i ∈ S • . Next, the second set of hard constraints yields that at least one of p l j = 0 for j ∈ N t for each p l , l ∈ S•. This gives

for all p l , l ∈ S•, and completes the proof.

Lemma 1 shows that any feasible solution of (MILP-2.i • ) can be used to form a • pattern. Now, note that if y l = 0 for l ∈ S • \ {i} in the solution, then the • pattern P formed also distinguishes p l from the• observations. Therefore, with the objective of minimizing the sum of y l 's, the MILP model can be understood as a way to generate a • pattern that distinguishes (more or less) a maximum number of • observations from the• observations. As easily seen, the number of 1's in the (optimal) solution determines the degree of the pattern generated.

As demonstrated in Ryoo and Jang (2005) , this model efficiently generates patterns of all degree with equal ease, provided that the number of training samples used is moderate and that n is not a big number. Genomic data are large-scale in nature, however. Furthermore, owing to constantly evolving viral serotypes, the complexity of viral flora is high and this requires large numbers of target and non-target viral samples to be used for selecting optimal genotyping probes. Adding to these the difficulties associated with numerical solution of MILP in general, we see that (MILP-2.i • ) presents no practical way of selecting genotyping probes.

With the need to develop a more efficient pattern generation scheme, we select a reference sequence p i , i ∈ S • , • ∈ {+, −}, and set

for k ∈ S• and j ∈ N . Next, we set

for l ∈ S • and j ∈ N . Now, consider the set covering model

where c j ( j ∈ N ) are positive real numbers (refer to Remark 4).

Theorem 1 Let (x, y) denote a feasible solution of (SC • i ). Then,

forms a • LAD pattern.

Proof To show the result, we need to show that the conjunction of literals formed via (2) distinguishes at least one • observation from all• observations. Toward the end, recall that p ik = 1(0) indicates the presence (absence) and the absence (presence) of probe k in the reference sequence selected p i , i ∈ S • , and in p k , k ∈ S•, respectively. With the cover (x, y) of (SC • i ) on hand, let us subdivide the index set N t = { j ∈ N :

Observe now that

= 1 implies that exactly one of p i j and p k j equals 1 for p i and p k , k ∈ S•. The cover (x, y) of (SC • i ) by definition satisfies all constraints of (SC • i ), and the hard constraints of the problem in the first set of cover inequalities require that at least one x l in the cover is set to 1 among l ∈ N with a (i,k) l = 1 for all k ∈ S•. This in turn implies that at least one p kl for l ∈ N 1 t or p il for l ∈ N 0 t equals 0 for all p k , k ∈ S• and yields

Note that P generated on the solution (x, y) of (SC • i ) via (2) also satisfies P( p l ) = 1 for all l ∈ S • \ {i} with y l = 0. The following result is immediate.

is also formulated in reference to p i for some i ∈ S • and finds a cover that distinguishes most • observations from the• observations. Therefore, although not identical, (SC • i ) can be seen as an SC version of (MILP-2.i • ). Although smaller than the MILP model by only one constraint and one integer variable, (SC • i ) has a much simpler structure and is defined only in terms of 0-1 variables. In addition, owing to having a wide range of practical applications, SC has invited the development of an array of efficient (meta-)heuristic solution procedures (e.g. Caprara et al. 1999 and references therein) and any of these can be used for solving (SC • i ) (refer to Remark 1). From the computational point of view, therefore, (SC • i ) is much preferred over its MILP counterpart.

Note that (SC • i ) is defined by m + + m − − 1 cover inequalities and n + m • − 1 binary variables. Also, recall that n is large for genomic sequences and the analysis of viral sequences requires large numbers of target and non-target sequences, that is, m + and m − are also large numbers. To develop a more compact SC-based probe selection model, we select a reference sequence p i , i ∈ S • , • ∈ {+, −}, and set the values of a (i,k) j for k ∈ S• and j ∈ N via (1). Consider the following SC model

where c j 's are positive reals (again, refer to Remark 4).

Theorem 2 Let x denote a feasible solution of (SC-pg • i ). Then, P generated on x via (2) forms a • LAD pattern.

Proof Same as the proof for Theorem 1.

We immediately have the following result that can be used for efficiently identifying the • observations that are also distinguished from the• observations by the pattern generated on the solution of (SC-pg • i ). Lemma 3 With a feasible solution x of (SC-pg • i ), generate a • pattern P via (2). Then, P distinguishes every • sequence p l , l ∈ S • , with p lk = p ik for all k ∈ N t from the• observations, where N t = { j ∈ N : x j = 1}.

Note that (SC-pg • i ) can be considered as a relaxation of (SC • i ): to see this, project (SC • i ) onto the space of x. Generally speaking, therefore, a feasible solution of (SC • i ) has more x j 's set to 1 in it than in a feasible solution of (SC • i ) formulated on the same data, hence tends to generate a higher degree pattern that generally explains a difference between the target and non-target sequences. As more • observations are distinguished from the• observations at a time by a solution of (SC • i ), it is formulated and solved for a less number of times for generating a set of • patterns that collectively distinguish all • observations from the• data in a dataset under analysis (refer to the oligo selection procedure detailed below). On the other hand, (SC-pg • i ) generates per solution a lower degree pattern that explains the specific difference between the reference • observation and the• sequences and, hence, is formulated and solved for a more number of times for generating a set of • patterns. Overall, the two models select about the same number of probes. However, as (SC-pg • i ) is much smaller in size, hence, is more efficiently solved, and because a high specificity is desired in genotyping applications, we prefer (SC-pg • i ) for selecting genotyping oligo probes. Using (SC-pg • i ), we design one simple oligo probe selection procedure below, where P • denotes the set of • patterns generated so far.

The following is immediate.

Theorem 3 procedure SC-pg terminates finitely.

A few remarks are due now.

Remark 1 Simply put, the number of 1's in the covers generated via procedure SC-pg determines the number of probes to be used for a specific genotyping purpose. In other words, the quality of an SC solution determines the cost of genotyping applications.

SC is a well-known N P-complete problem (Garey and Johnson 1979) . Owing to having a wide range of practical applications (despite its simple structure), SC has invited an array of (meta-)heuristic solution procedures to be developed for its efficient heuristic solution (e.g., Caprara et al. 1999 and references therein) and any of these can be used for solving (SC-pg • i ). In fact, the genotyping accuracy is not affected at all as long as the covers found are near-optimal and "good enough" (see results in the following section) and this was the rationale behind our developing SC-based probe selection models in this paper: recall that probe selection is a large-scale combinatorial optimization problem in nature.

Furthermore, the efficiency of SC heuristic solution procedures allows one to apply procedure SC-pg or the similar directly to the binarized data to generate patterns without going through the feature selection phase. This is another benefit the SC-based pattern generation offers over its MILP counterparts from Ryoo and Jang (2005) or the standard term-enumeration-based procedure for generating patterns in the LAD literature (e.g., Boros et al. 2000) .

i ) has all zero coefficients, the SC instance is infeasible. This case arises when the reference sequence p i , i ∈ S • , and the sequence p j , j ∈ S•, have identical 0-1 fingerprints, which is a contradiction. Supervised learning methodologies, including LAD, presume for the existence of a classification function that each unique sequence in the training set belongs to exactly one of the two classes. When this holds, contradiction-free 0-1 clones of the original data can always be obtained by using oligos of longer length for data binarization.

Remark 3 If desired, the hybridization affinity of probes can be ensured in a number of ways, including the following. First, during data binarization, one can remove from further consideration each l-mer with the GC content less than a prescribed level or with the melting temperature calculated via, for example, the formula found in Wang and Seed (2003) that falls outside a certain prescribed range from the median melting temperature of all l-mers generated. Next, the proposed LAD-based method can be applied to select an optimal set of probes on the surviving l-mers that are "compatible" in terms of their hybridization behavior.

i ) is a general-purpose model and can be specialized to select a minimal set of optimal oligo probes by any quantifiable probe selection criterion. For example, one may use the longest common factors from Rahmann (2003) or the OVL scores from Herwig et al. (2000) for c j values in (SG-pg • i ) to select probes by the (dis-)similarity preference. One may use, for example, the Shannon entropy scores from Herwig et al. (2000) for c j 's and incorporate the complexity of oligos in probe selection.

Denote by P + 1 , . . . , P + n + and P − 1 , . . . , P − n − the positive and negative patterns, respectively, generated via procedure SC-pg. In classifying unseen + (target) and − (non-target) sequences, we use three decision rules. First, in polyspecific genotyping applications (see, for example, Experiment 4 in Sect. 3.2), we form the standard LAD classification rule (Boros et al. 2000 )

where ω • i denotes the number of • training sequences covered by P • i and assign class + (−) to new sequence p if ( p) > 0 ( ( p) < 0). We fail to classify sequence p if ( p) = 0. For the monospecific genotyping, we use a strict classification rule. Specifically, for classification of two viral (sub-)types (see, for example, Experiment 1 in Sect. 3.2), we form a decision rule by

and assign p to class • if • ( p) > 0 while • ( p) = 0. When • ( p) > 0 and • ( p) > 0 or when • ( p) = 0 and • ( p) = 0, we fail in classifying the sequence. For the monospecific classification of more than two viral (sub-)types k = 1, . . . , m (see, for example, Experiment 7 in Sect. 3.2), we use the decision rule

where P k 1 , . . . , P k n k are the probe(s) selected to for virus (sub-)type k, and assign p to class k if k ( p) > 0 while i ( p) = 0 for all i = 1, . . . , m, i = k. When ( p) > 0 for more than two virus types or k = 0 for all k, then we fail to assign a class to sequence p.

In this section, we extensively test the proposed probe design for classification of viral disease-agents in in silico setting. To make these experiments as "realistic" as possible, we design each of these experiments based on information from the literature and the official website of the World Health Organization (WHO) and use viral genomic sequences obtained from the National Center for Biotechnology Information (NCBI) and human papillomavirus (HPV) sequences from the Los Alamos National Laboratory. To be more specific about the data used, we obtained the HPV data from the Los Alamos National Laboratory site for illustrative and comparative purposes. These data correspond to the 72 high and low risk HPV sequences that are used in Eom et al. (2004) and Park et al. (2003) . Although some of these manually classified virus sequences contain classification errors (Eom et al. 2004 ), we used the data with Table 1 , we provide the number and the length (the minimum, average ± 1 standard deviation and maximum length) of each type of the genomic data used in our experiments. In analyzing data in an experiment, we first decided on a length of oligos to use by calculating the smallest integer value l such that 4 l became larger than or equal to the average of the lengths of target and non-target sequences of the experiment. Then, 4 l candidate oligos were generated to fingerprint and binarize the data. If the length of oligos turned out to be not long enough during the pattern generation stage (see Remark 2), the data binarization stage was repeated with the value of l incremented by 1 and this process was repeated until the binary representations of the data became contradiction free. Next, procedure SC-pg was applied to generate patterns, hence, probes. In applying procedure SC-pg in these in silico experiments, we did not consider any oligo picking criterion that is non-theoretical in nature (refer to Remark 4) and selected a minimal set of oligo probes with using c j = 1 for all j ∈ N . For solving the unicost (SC-pg • i )'s generated, we used for ease of implementation the textbook heuristic procedure (e.g., Nemhauser and Wolsey 1988 ) that selects one variable at a time by the rule

where I j denotes the index of rows k with a (i,k) j = 1 and M u denotes the set of rows that are not yet covered by the partial cover x on hand.

In each of the experiments in this section, in order to fairly assess the classification capabilities of oligo probes selected by the proposed probe design procedure, we 1. randomly selected 90% of the target and of the non-target data to form a training set of sequences; 2. binarized the training data; 3. selected optimal oligo probes on the training data via procedure SC-pg; 4. formed a classification rule by one of (3), (4) and (5) with the selected oligo probes; 5. used the classification rule to (sub-)type each of the reserved testing sequences, consisting of the remaining 10% of the target and the non-target sequences; and 6. repeated steps above 20 times to obtain the average testing performance and other relevant information of the experiment.

The computational platform used for experiments was an Intel 2.66GHz Pentium Linux PC with 512Mb of memory.

Infection with HPV is the main cause of cervical cancer, the second most common cancer in women worldwide (Bosch et al. 2002; Muñoz et al. 2003) . There are more than 80 identified types of HPV and the genital HPV types are subdivided into high and low risk types: low risk HPV types are responsible for most common sexually transmitted viral infections while high risk HPV types are a crucial etiological factor for the development of cervical cancer (e.g., McFadden and Schumann 2001) .

We applied the proposed probe design method on the 72 HPV sequences downloaded from the Los Alamos National Laboratory with their classification found in Table 3 of Park et al. (2003) . The selected probes were used to form a decision rule by (3) and tested for their classification capability.

Results from this polyspecific probe selection experiment are provided in Table 2 . In this and the other tables in this section, the target (+) and the non-target (−) virus types of the experiments are specified in the first column. Then, the tables provide two bits of information on the candidate oligos, namely, the length l and the average and the standard deviation of the number of features generated and used in the 20 runs of each experiment for data binarization and for pattern generation: recall that we skip the feature selection stage of LAD (see Remark 1). Provided next in the tables is the information on the number of probes selected in the format "the average ± 1 standard deviation" and information on the LAD patterns generated. Finally, the testing performance of the probes selected is provided in the format "the average ± 1 standard deviation" of the percentage of the correct classifications of the unseen sequences. Briefly summarizing, the proposed probe design method selected probes on the HPV data in a few CPU seconds that tested 90.6% accurate in classifying unseen HPV samples. For comparison, the same HPV dataset was used in Eom et al. (2004) and Park et al. (2003) for the classification of HPV by high and low risk types. In brief, the probe design methods of Eom et al. (2004) and Park et al. (2003) required several CPU hours of computation and selected probes that obtained 85.6 and 81.1% correct classification rates, respectively.

Before moving on, we note that the sequences belonging to the target and the nontarget groups in this experiment all have different HPV subtypes (see Table 3 in Park et al. 2003) . The combination of all target and non-target sequences being different from one another and the presence of noise in the data (the classification errors) gave rise to selecting a relatively large number of polyspecific probes in this experiment.

The proposed probe design method was extensively tested on genomic viral sequences from NCBI for selecting monospecific and polyspecific probes for screening for SARS and AI in a number of different binary and multicategory experimental setting and performed superbly on all counts. We summarize the results from some of these experiments in this section.

Before proceeding, we briefly illustrate the benefit of probe selection via (SC-pg • i ) from the computational point of view with Experiment 5 below. For the purpose, let us first recall that probe selection is a combinatorial optimization problem. Therefore, for the selection of oligo probes for differentiating lethal AI virus H5 and H9 from the other AI virus H subtypes in Experiment 5, a supervised learning method based on a complete pairwise differencing of the target and non-target training sequences (e.g., Borneman et al. 2001; Boros et al. 2000; Klau et al. 2004; Rahmann 2003) a In format average ± standard deviation b Percentage of correct classifications of testing/unseen data would require solving one or more combinatorial optimization problems with between (148+93)×(137+660+77+65) = 226, 299 and 137×660×148×77×93×65 ≈ 6.23 × 10 12 rows (refer to Table 1 above for the numbers of the target and non-target viral sequences) and with at least 39, 056 0-1 decision variables (see Table 7 for the average number of l−mers generated in this experiment). For Experiment 5, we note in comparison that the largest (SC-pg • i ) instance generated and solved by procedure SC-pg had max{148 + 93, 137 + 660 + 77 + 65} = 939 rows and 39, 056 columns.

SARS virus is phylogenetically most closely related to group 2 coronavirus (Snijder et al. 2003) . 105 SARS sequences and 39 coronavirus samples were used to select 1 monospecific probe for screening for SARS. Used in a classification rule (4), the SARS probe and one probe selected for coronavirus together perfectly classified all testing sequences (see Table 3 ).

This experiment simulates a SARS pandemic where suspected patients with SARSlike symptoms are screened for the disease. We used the 105 SARS virus sequences and 107 samples of other influenza virus types (the "other virus" in Table 1 ) in this experiment and selected polyspecific probes. Used in a classification rule (3), these probes collectively gave the perfect classification of all testing sequences (see Table 4 ).

AI virus H7N7 is highly pathogenic with the capacity to pass from human-to-human, and this raised concerns for a possible viral reassortment with human influenza H1N1 and H3N2 strains during a large outbreak of H7N7 infection in the Netherlands in 2003 (Koopmans et al. 2004; Webby and Webster 2003) .

Based on information from Koopmans et al. (2004) , we replicated the classification of H7 and other influenza virus H subtypes in this experiment by using 77 H7 sequences and 1,103 other H subtype samples. Polyspecific probes were selected and tested in a classification rule (3) to give the perfect classification rate (see Table 5 ).

H5 and H7 have an ominous capacity to pass from human to human (http://www. who.int; Webby and Webster 2003) . This experiment, using 225 H5 and H7 viral samples and 955 other H subtype sequences, selected polyspecific probes for detecting the two pathogenic H subtypes of the AI virus from the other influenza virus H subtypes and vice versa. A classification rule was formed by (3) for testing the selected probes, and we obtained the perfect testing result (see Table 6 ).

Experiment 5 Classification of lethal AI virus H5 and H9 and other influenza virus H subtypes.

AI virus H5 and H9 subtypes cause a most fatal form of the disease (Koopmans et al. 2004) , and they were separated from the other H subtypes of influenza virus in this experiment. 241 H5 and H9 target sequences and 939 other H subtype sequences were used to select polyspecific probes for detecting AI virus H5 and H9 subtypes from the rest. In a classification rule (3), the selected probes collectively classified all testing sequences correctly (see Table 7 ).

Monospecific classification of SARS, human influenza H1, human influenza H3, AI H5 and AI H7 virus.

This multicategory classification experiment selects monospecific probes for distinguishing one from another a few notorious viral pathogens. We used 103 SARS virus, 137 human influenza virus H1, 660 human influenza virus H3, 148 lethal AI virus H5 and 77 pathogenic AI virus H7 sequences and selected monospecific probes for each virus type in sequential binary classification of "one type against the rest." The selected probes were tested in a classification rule (5) to classify the testing sequences p by a strict decision rule of "assign class i to p only if one or more probes selected for virus type i is found in p while none of the probes selected for the other types are not" and gave the perfect classification result (see Table 8 ; Note that only a small number of monospecific probes were selected, as in Experiment 1).

Monospecific Classification of N1, N2 and N3 influenza virus.

The statement "monospecific neuraminidase (NA) subtype probes were insufficiently diverse to allow confident NA subtype assignment" from Sengupta et al. (2003) motivated us to design this experiment on multicategory and monospecific classification of influenza virus by N subtypes. We used the three influenza virus N subtypes with 30 or more samples in Table 1 and selected monospecific probes for their classification. Tested in a classification rule (5), the selected probes performed perfectly in classifying all testing sequences (see Table 9 ; note again that only a small number of monospecific probes were selected and proved "needed" in this experiment, as in the other two monospecific genotyping experiments, Experiments 1 and 6).

The problem of probe design for hybridization-based experiments is an interesting problem lying at the intersection of molecular biology and optimization but has received relatively little attention from the OR community. In this paper, we specialized a general LAD framework from Ryoo and Jang (2005) for efficiently handling 3 ± 0 D e g r e e1 100 ± 0 N2 3.7 ± 0.5 Degree 1 only N3 1 ± 0 D e g r e e1

a In format average ± standard deviation b Percentage of correct classifications of testing/unseen data large-scale genomic data and developed a probe design method for selecting short oligo probes for genotyping applications. Extensively tested on genomic sequences obtained from the National Center of Biotechnology Information and the Los Alamos National Laboratory in various monospecific and polyspecific in silico experiments, the proposed probe design method was able to select a small number of oligo probes of length 7 or 8 nucleotides that performed superbly in classifying unseen testing sequences. These in silico results demonstrate the efficacy of the proposed oligo design method. Experimental results further illustrate a huge potential a well-designed optimization-based probe design method has in hybridization-based genotyping applications.

Collaborative research activities are planned to realize the in silico performance of the proposed probe design method on microarrays and in real hybridization experiments. Also, we plan to investigate the possibility of exploiting frequently used oligo selection criteria (e.g., Herwig et al. 2000; Lee et al. 2004; Li and Stormo 2001; Rahmann 2003) within the proposed probe design framework to further improve its effectiveness in terms of the number of probes needed.

Basic local alignment search tool

Probe selection algorithms with applications in the analysis of microbial communities

An implementation of logical analysis of data

The causal relation between human papillomavirus and cervical cancer

A heuristic method for the set covering problem

Support vector networks

Genetic mining of dna sequence structures for effective classification of the risk types of human papillomavirus (hpv)

New York Hammer PL (1986) Partially defined boolean functions and cause-effect relationships

Discovery and analysis of inflammatory disease-related genes using cdna microarrays

Information theoretical probe selection for hybridisation experiments

Optimal robust non-unique probe selection using integer linear programming

Transmission of h7n7 avian influenza a virus to human beings during a large outbreak in commercial poultry farms in the netherlands

Multi-objective evolutionary probe design based on thermodynamic criteria for hpv detection

Classification of multiple cancer types by multicategory support vector machines using gene expression data

Selection of optimal dna oligos for gene expression arrays

Possibility of using dna chip technology for diagnosis of human papillomavirus

The role of human papillomavirus in screening for cervical cancer

On the complexity of polyhedral separability

For the International Agency for Research on Cancer Multicenter Cervical Cancer Study Group Epidemiologic classification of human papillomavirus types associated with cervical cancer

Classification of the risk types of human papillomavirus by decision trees

Fast large scale oligonucleotide selection using the longest common factor approach

Milp approach to pattern generation in logical analysis of data

DNA microarray: a practical approach

Molecular detection and identification of influenza viruses by oligonucleotide microarray hybridization

Unique and conserved features of genome and proteome of sars-coronavirus, an early split-off from the coronavirus group 2 lineage

Trends in microarray analysis

Pattern recognition techniques. Crane, London Vapnik VN (1998) Statistical learning theory

Selection of oligonucleotide probes for protein coding sequences

Microarray-based detection and genotyping of viral pathogens

Are we ready for pandemic influenza?

The design and application of dna chips for early detection of sars-cov from clinical samples