key: cord-0036330-9rx4jlw8 authors: Kim, Kwangsoo; Ryoo, Hong Seo title: Selecting Genotyping Oligo Probes Via Logical Analysis of Data date: 2007 journal: Advances in Artificial Intelligence DOI: 10.1007/978-3-540-72665-4_8 sha: 89c19cc8be064cc6385f378206bec5a5fe7b6a5a doc_id: 36330 cord_uid: 9rx4jlw8 Based on the general framework of logical analysis of data, we develop a probe design method for selecting short oligo probes for genotyping applications in this paper. When extensively tested on genomic sequences downloaded from the Lost Alamos National Laboratory and the National Center of Biotechnology Information websites in various monospecific and polyspecific in silico experimental settings, the proposed probe design method selected a small number of oligo probes of length 7 or 8 nucleotides that perfectly classified all unseen testing sequences. These results well illustrate the utility of the proposed method in genotyping applications. A microarray or a DNA chip is a small glass or silica surface bearing DNA probes. Probes are single stranded reverse transcribed mRNAs, each located at a specific spot of the chip for hybridization with its Watson-Crick complementary sequence in a target to form the double helix [1] . Microarrays currently use two forms of probes, namely, oligonucleotide (shortly, oligo) and cDNA, and have prevalently been used in the analysis of gene expression levels, which measures the amount of gene expression in a cell by observing the hybridization of mRNA to different probes, each targeting a specific gene. With the ability to identify a specific target in a biological sample, microarrays are also well-suited for detecting biological agents for genetic and chronic disease [2, 3, 4, 5] . Furthermore, as viral pathogens can be detected at the molecular and genomic level much before the onset of physical symptoms in a patient, the microarray technology can be used for an early detection of patients infected with viral pathogens [6, 7, 8] . The success of microarrays depends on the quality of probes that are tethered on the chip. Having an optimized set of probes is beneficial for two reasons. One, the background hybridization is minimized, hence true gene expression levels can be more accurately determined [9] . The other, as the number of oligos needed per gene is minimized, the cost of each microarray is minimized or the number of genes on each chip is increased, yielding oligo fingerprinting a much faster and more cost-efficient technique [9, 10] . Short probes consisting of 15-25 nucleotides (nt) are used in genotyping applications [1] . Having short optimal probes means a high genotyping accuracy in terms of both sensitivity and specificity [6, 9] and can play a key role in genotyping applications. From the perspective of numerical optimization, genomic data present an unprecedented challenge for supervised learning approaches for a number of reasons. First, genomic data are long sequences over the nucleic acid alphabet Σ = {A,C,G,T}. Second, for example, the complexity of viral flora, owing to constantly evolving viral serotypes, requires a supervised learning theory to be trained on a large collection of target and non-target samples. That is, a typical training set contains a large number of large-scale samples. Third, a supervised learning framework usually requires a systematic pairing or differencing between each target and non-target samples during the course of training a decision rule [10, 11, 12, 13] . Adding to these, the nature of data classification is difficult [14] . Based on the general framework of logical analysis of data (LAD) from [15] , we develop in this paper a probe design method for selecting short oligo probes of length l nt, where l ∈ [6, 10] . To list some advantages of selecting oligo probes by the proposed method, first, the method selects probes via sequential solution of a small number of compact set covering (SC) instances, which offers a great advantage from computational point of view. To be more specific, consider classification of two types of data and suppose that a training set is comprised of m + target and m − non-target sequences. The size of the SC training instances solved by the proposed method is minimum of m + and m − orders of magnitude smaller than optimization learning models used in [10, 11, 12] . Second, the method uses the sequence information only and selects probes via optimization based on principles of probability and statistics. That is, the probability of an l−mer (oligo of length l) appearing in a single sequence by chance is (0.25) l , hence the probability of an l−mer appearing in multiple samples of one type but in none or only a few of the sequences of the other type by chance alone is extremely small. Third, the proposed method does not rely on any extra tool, such as BLASTn [16] , a local sequence alignment search tool that is commonly used for probe selection [6, 8, 17] , or the existence of pre-selected representative probes [6] . This makes the method truly stand-alone and free of problems that may possibly be caused by limitations associated with external factors. Last, with an array of efficient (meta-)heuristic solution procedures for SC, the proposed method is readily implementable for an efficient selection of oligo probes. As for the organization of this paper, we develop an effective method for selecting short oligo probes in Section 2 (for reasons of space, we omit proofs for the mathematical results in this section) and extensively test the proposed probe design method in various in silico genotyping experiments in Section 3 with using viral genomic sequences from the Los Alamos National Laboratory and the National Center of Biotechnology Information websites. The task of classifying more than two types of data can be accomplished by sequential classifications of two types of + and − data (see [18, 19, 20] and Section 3 below). Without loss of generality, therefore, we present the material below in the context of binary classification. The backbone of the proposed procedure is LAD. A typical implementation of LAD analyzes data on hand via four sequential stages of data binarization, support feature selection, pattern generation and classification rule formation. As a Boolean logic-based, LAD first converts all non-binary data into equivalent binary observations. A + (−) 'pattern' in LAD is defined as a conjunction of one or more binary attributes or their negations that distinguishes one or more + (−) type observations from all − (+) observations. The number of attributes used in a pattern is called the 'degree' of the pattern. As seen from the definition, patterns hold the structural information hidden in data. After patterns are generated, they are aggregated into a partially-defined Boolean discriminant function/rule to generalize the discovered knowledge to classify new observations. Referring readers to [13, 15, 21] for more background in LAD, we design a LAD-based method below for efficiently analyzing large-scale genomic data. Let there be m + and m − sample observations of type + (target) and − (nontarget), respectively. For A DNA sequence is a sequence of nucleic acids A, C, G and T, and the training sequences need to be converted into Boolean sequences of 0 and 1 before LAD can be applied. Toward this end, we first choose an integer value for l, usually l ∈ [6, 10] (see Section 3), generate all 4 l possible l−mers over the four nucleic acid letters and then number them consecutively from 1 to 4 l by a mapping scheme. Next, each l−mer is selected in turn and every training sample is fingerprinted with the oligo for its presence or absence. That is, with oligo j, we scan each sequence p i , i ∈ S + ∪ S − , from the beginning of the sequence and shifting to the right by a base and stamp p ij = 1, if oligo j is present in sequence i; and 0, otherwise. After this, the oligos that appear in all or none of the training sequences can be deleted from further consideration. We re-number the surviving l−mers consecutively from 1 to n and replace the original training sequences described in the nucleic acid alphabets by their Boolean representations. Let N = {1, . . . , n}. The data are now described by n attributes a j ∈ {0, 1}, j ∈ N . For observation p i , i ∈ S • , • ∈ {+, −}, let p ij denote the binary value the j−th attribute takes in this observation. Denote by l j the literal of binary attribute a j . Then, l j = a j (l j = a j ) instructs to take (negate) the value of a j in all sequences. A term t is a conjunction of literals. Given a term t, let N t ⊆ N denote the index of literals included in the term. Then, Note here that N t of a • pattern identifies probes that collectively distinguish one or more • sequences from the sequences of the other type. Let us introduce n additional features a n+j , j ∈ N , and use a n+j to negate a j . Let N = {1, . . . , 2n} and let us introduce a binary decision variable x j for a j , j ∈ N , to determine whether to include l j in a pattern. [15] formulated a compact mixed integer and linear programming (MILP) model below with respect to a reference sample Consider the following. We note here that genomic data are large-scale in nature. Furthermore, owing to constantly evolving viral serotypes, the complexity of viral flora is high, and this requires large numbers of target and non-target viral samples to be used for selecting optimal genotyping probes. Adding to these the difficulties associated with numerical solution of MILP, we see that (MILP-2.i • ) above presents no practical way of selecting genotyping probes. With the need to develop a more efficient pattern generation scheme, we select a reference sequence for k ∈ S• and j ∈ N. Next, we set for l ∈ S • and j ∈ N. Now, consider the set covering model where c j (j ∈ N ) are positive real numbers. Let (x, y) denote a feasible solution of (SC • i ). Then, forms a • LAD pattern. Although smaller than the MILP counterpart by only one constraint and one integer variable, (SC • i ) has a much simpler structure and is defined only in terms of 0-1 variables. In addition, it can exploit any of SC heuristic procedures developed so far (see, for example, [22] and references therein) for its efficient solution, hence is much preferred. Note that (SC • i ) is defined by m + + m − − 1 cover inequalities and n + m • − 1 binary variables. Also, recall that n is large for genomic sequences and the analysis of viral sequences requires large numbers of target and non-target sequences, that is, m + and m − are also large numbers. To develop a more compact SC-based probe selection model, we select a reference sequence p i , i ∈ S • , • ∈ {+, −}, and set the values of a (i,k) j for k ∈ S• and j ∈ N via (1). Consider the following SC model: where c j 's are positive reals. Theorem 2. Let x denote a feasible solution of (SC-pg • i ). Then, P generated on x via (2) forms a • LAD pattern. Below, we use (SC-pg • i ) to design one simple oligo probe selection procedure. Let P • denote the set of • patterns generated so far. In this section, we extensively test the proposed probe design for the classification of viral disease-agents in in silico setting with using genomic sequences obtained from the Los Alamos National Laboratory (LANL) and the National Center for Biotechnology Information (NCBI). Table 1 summarizes the number and the length (the minimum, average±1 standard deviation and maximum lengths) of each type of the genomic data that were used in our experiments. In analyzing data in an experiment, we first decided on a length of oligos to use by calculating the smallest integer value l such that 4 l became larger than or equal to the average of the lengths of target and non-target sequences of the experiment. Then, 4 l candidate oligos were generated to fingerprint and binarize the data. Note here that if a constraint in (SC-pg • i ) has all zero coefficients, then the SC instance has no feasible solution, and this case arises when the reference sequence p i , i ∈ S • , and the sequence p j , j ∈ S• have identical 0-1 fingerprints, which is a contradiction. Supervised learning methodologies, including LAD, presume for the existence of a classification function that each unique sequence in the training set belongs to exactly one of the two classes. When data under analysis are indeed contradiction-free, then contradiction-free 0-1 clones of the data can always be obtained by using oligos of longer length for data fingerprinting and binarization. Therefore, when we generated the identical fingerprint for data of different types, we incremented the value of l by 1 and repeated the data binarization stage until the binary representations of the data became contradiction free. Next, procedure SC-pg was applied to generate patterns, hence probes. In applying procedure SC-pg in these in silico experiments, we selected a minimal set of oligo probes by setting c j = 1 for all j ∈ N . For solving the unicost (SC-pg • i )'s generated, we used the textbook greedy heuristic [23] for ease of implementation. Denote by P + 1 , . . . , P + n+ and P − 1 , . . . , P − n− the positive and negative patterns, respectively, generated via procedure SC-pg. In classifying unseen + (target) and − (non-target) sequences, we use three decision rules. Specifically, for the polyspecific genotyping experiments (in Section 3.1 and Experiments 2 and 3 in Section 3.2), we form the standard LAD classification rule [13] Δ := where ω • i denotes the number of • training sequences covered by P • i . We assign class + (−) to new sequence p if Δ(p) > 0 (Δ(p) < 0). We fail to classify sequence p if Δ(p) = 0. For monospecific genotyping in Experiment 1 in Section 3.2, we form a decision rule by where P k 1 , . . . , P k n k are the probe(s) selected to for virus (sub-)type k, and assign p to class k if Δ k (p) > 0 while Δ i (p) = 0 for all i = 1, . . . , m, i = k. When Δ(p) > 0 for more than two virus types or Δ k (p) = 0 for all k, then we fail to assign a class to sequence p. In each of the experiments in this section, we tested the proposed oligo probe selection method in 20 independent hold-out experiments, each with randomly selected 90% of the target and of the non-target data forming a training set of sequences and the remaining 10 % of the target and of the non-target sequences forming the testing data. More specifically, after a training set of data was formed, we binarized the training data and selected optimal oligo probes on them via procedure SC-pg. Next, a classification rule was formed by one of (3), (4) and (5) above and then used for classifying the corresponding testing sequences. These steps were repeated 20 times to obtain the average testing performance and other relevant information of the experiment. The computational platform used for these experiments was an Intel 2.66GHz Pentium Linux PC with 512Mb of memory. (4, 5, 6, 7, 8, 9) 64 1341 1434±25 1467 The infection with HPV is the main cause of cervical cancer, the second most common cancer in women worldwide [24, 25] . There are more than 80 identified types of HPV and the genital HPV types are subdivided into high and low risk types: low risk HPV types are responsible for most common sexually transmitted viral infections while high risk HPV types are a crucial etiological factor for the development of cervical cancer [26] . We applied the proposed probed design method on the 72 HPV sequences downloaded from LANL with their classification found in Table 3 of [27] . The selected probes were used to form a decision rule by (3) and tested for their classification capability. Results from this polyspecific probe selection experiment are provided in Table 2 . In this table and also in the table found in the following subsection, the target (+) and the non-target (−) virus types of the experiments are first specified. Then, the tables provide two bits of information on the candidate oligos, namely, the length l and the average and the standard deviation of the number of features generated and used in the 20 runs of each experiment for data binarization and for pattern generation. Provided next in the tables is the information on the number of probes selected in the format 'the average ± 1 standard deviation' and information on the LAD patterns generated. Finally, the testing performance of the probes selected is provided in the last column of the tables, summarized in format 'the average ± 1 standard deviation' of the percentage of the correct classifications of the unseen sequences. Briefly summarizing, the proposed probe design method selected probes on the HPV data in a few CPU seconds that tested 90.6% accurate in classifying the unseen HPV samples. For comparison, the same HPV dataset was used in [2] and [27] for the classification of HPV by high and low risk types. In brief, the probe design methods of [2] and [27] required several CPU hours of computation and selected probes that obtained 85.6% and 81.1% correct classification rates, respectively. Before moving on, we note that the sequences belonging to the target and the non-target groups in this experiment all have different HPV subtypes (see Table 3 in [27] ). The combination of all target and non-target sequences being different from one another and the presence of noise in the data (the classification errors) gave rise to selecting a relatively large number of polyspecific probes in this experiment. The proposed probe design method was tested on genomic viral sequences from NCBI for selecting monospecific and polyspecific probes for screening for SARS and AI in a number of different binary and multicategory experimental setting and performed superbly on all counts. We describe individual experiments below and summarize results from these experiments in Table 3 . 1±0 d e g r e e1 * : in format average ± standard deviation † : percentage of correct classifications of testing/unseen data Experiment 1. SARS virus vs. coronavirus SARS virus is phylogenetically most closely related to group 2 coronavirus [28] . 105 SARS sequences and 39 coronavirus samples were used to select 1 monospecific probe for screening for SARS. Used in a classification rule (4), the SARS probe and one probe selected for coronavirus together perfectly classified all testing sequences. This experiment simulates a SARS pandemic where suspected patients with SARS-like symptoms are screened for the disease. We used the 105 SARS virus sequences and 108 samples of other influenza virus types (the 'other virus' in Table 1 ) in this experiment and selected polyspecific probes. Used in a classification rule (3), these probes collectively gave the perfect classification of all testing sequences. Experiment 3. Classification of lethal AI virus H5 & H9 and other influenza virus H subtypes AI virus H5 and H9 subtypes cause a most fatal form of the disease [29] , and they were separated from the other H subtypes of influenza virus in this experiment. 241 H5 and H9 target sequences and 1010 other H subtype sequences were used to select polyspecific probes for detecting AI virus H5 and H9 subtypes from the rest. In a classification rule (3), the selected probes collectively classified all testing sequences correctly. The statement "monospecific neuraminidase (NA) subtype probes were insufficiently divers to allow confident NA subtype assignment" from [6] motivated us to design this experiment on multicategory and monospecific classification of influenza virus by N subtypes. We used the three influenza virus N subtypes with 30 or more samples in Table 1 and selected monospecific probes for their classification. Tested in a classification rule (5), the selected probes performed perfectly in classifying all testing sequences. Note that only a small number of monospecific probes were selected and proved 'needed' in this experiment. Trends in microarray analysis Genetic mining of dna sequence structures for effective classification of the risk types of human papillomavirus (hpv) Discovery and analysis of inflammatory disease-related genes using cdna microarrays Possibility of using dna chip technology for diagnosis of human papillomavirus Classification of multiple cancer types by multicategory support vector machines using gene expression data Molecular detection and identification of influenza viruses by oligonucleotide microarray hybridization Dna-chip technology and infectious diseases Microarray-based detection and genotyping of viral pathogens Selection of optimal dna oligos for gene expression arrays Probe selection algorithms with applications in the analysis of microbial communities Fast large scale oligonucleotide selection using the longest common factor approach Optimal robust non-unique probe selection using integer linear programming An implementation of logical analysis of data On the complexity of polyhedral separability Milp approach to pattern generation in logical analysis of data Basic local alignment search tool Selection of oligonucleotide probes for protein coding sequences Support vector networks Pattern Recognition Techniques. Crane Statistical Learning Theory Partially defined boolean functions and cause-effect relationships A heuristic method for the set covering problem Integer and Combinatorial Optimization for the International Agency for Research on Cancer Multicenter Cervical Cancer Study Group: Epidemiologic classification of human papillomavirus types associated with cervical cancer The causal relation between human papillomavirus and cervical cancer The role of human papillomavirus in screening for cervical cancer Classification of the risk types of human papillomavirus by decision trees Unique and conserved features of genome and proteome of sars-coronavirus, an early split-off from the coronavirus group 2 lineage Transmission of h7n7 avian influenza a virus to human beings during a large outbreak in commercial poultry farms in the netherlands